Sound is the main form of communication for both humans and animals. Humans communicate verbally through speech and body language. The ear is a wonderful sensory organ in our human body to perceive the sensation of sound when exposed to pressure fluctuations about the mean atmospheric pressure. These pressure fluctuations in the air medium around are due to the disturbance created by any vibrating structure or source. The vibration energy can come from a tuning fork, a guitar string, the vocal cords of a speaker, or anything that vibrates in a frequency range 16 to 20000 Hz that is audible to the listener1. The ear can easily pick up sound with pressure fluctuations as low as 10-5 Pa to as high as 103 Pa. Communication experts call this ‘dynamic range’ which is of the order of 108 which the ear can easily accommodate as part of hearing.
Complex sound waves are produced by the speaker as part of the mechanism of speech involving lungs, vocal folds, mouth and nose, and it becomes the primary source of raw material for the listener to recover the speaker’s message by way of hearing. The total process is very complex.
The purpose of this chapter is to review some basic principles underlying the physics of sound, hearing and speech. Complex sound waves which we produce as part of speech communication include sounds consisting of various combinations of buzzes, hisses, and pops, and so forth. These sounds get modified through a filtering action by making a number of fine adjustments to the movement of tongue, lips, jaw, soft palate and other articulates before we recognise them as intelligible language.
It is very interesting to understand the various complex steps at the receiving end that occur in the ear such as splitting this complex sound into various frequency components in much the same way that a prism breaks white light into different optical frequencies or like a spectrum analyser in the laboratory. Putting it in a different way, a sound that is picked up by the pinna goes through external auditory canal (EAC), then to the middle ear, inner ear, auditory pathway and finally to the auditory cortex area no. 41 in the brain to get the sensation of sound.
Hearing, with the ear being the most important among the five senses, plays an important role in the development of speech, communication, cognition, emotional and social development of a human being. Being a hearing-impaired puts a step backward in the overall development of an individual. It is very essential to identify any hearing impairment in the early stages and treat it effectively. Some pleasant sounds like birds chirping, gushing water and soft music will reduce stress, lower blood pressure and give a feeling of tranquility. On the other hand, noisy sounds can make humans suffer stress and cause high blood pressure and other ailments.
In any perspective, it is very important to have a good knowledge of how sound is generated, communicated and heard by the human body; this is the core focus of this chapter.
• Fundamentals of sound and relevant terminologies
• Characteristics of sound and its units of measurement
• Physiology of speech along with speech processing and production
• Various aspects of sound interaction with the human body.
Sound energy is a form of mechanical energy that produces the sensation of hearing in our ears. Sound is a travelling disturbance like a ripple produced on the surface of disturbed water in a pond. Any vibrating object produces sound. It reaches us through the vibrations of particles of the medium. A sound wave propagates in the form of a longitudinal wave consisting of compressions and rarefactions of the molecules of the medium, be it air, liquid or solid, in which it travels.
A sound wave which travels in the air is a pressure fluctuation about the mean atmospheric pressure that results from any source of vibration. The vibration can come from a tuning fork, a guitar string, the vocal cords of a speaker or virtually anything that vibrates in a frequency range that is audible to a listener (roughly 16 to 20,000 Hz)1. A sound wave must not only be within the hearing range, but it has to be loud enough to be perceived. Mass, elasticity and density of the air medium are necessary characteristics for a sound wave to travel from the source to the receiver. The ear, as the receiver, perceives such pressure fluctuations in the medium and converts them into electrical pulses or neurons and transmits the neurons to the brain which interprets the disturbance as sound. The ear has got some response time to receive and process the signal. Human beings respond to mean square pressure averaged over 35 msec interval.
Although the human ear can hear from 16 Hz to 20 kHz, the human speech frequency range covers only 125 Hz to 8 kHz. Among the genders, speech in males lies between the frequencies of 125 to 4000 Hz while females can go one octave higher, that is, between 125 to 8000 Hz. The audible frequency range is divided into octave bands such that the ratio of two frequencies is two. About 1000 Hz is recognised internationally as the standard reference frequency, and mid-frequencies of all octave bands have been fixed around this frequency. The human ear responds differently to different frequencies.
As stated earlier, human ears are sensitive to only a limited range of frequencies from 16 Hz to 20 kHz, which is known as the range of audibility. But this audibility range reduces with age as the hearing sensitivity of ears falls for both low and high frequencies, leading to gradual hearing loss.
The sound frequencies below 20 Hz are known as infrasonic sound while frequencies more than 20,000 Hz are known as ultrasonic sound. Both the infrasonic and ultrasonic frequencies are inaudible to human ears. But some animal species such as elephants, bats, whales and so forth can hear both infrasonic and ultrasonic sounds.
When sound waves strike a hard surface, they return back in the same medium obeying the laws of refelction, that is,
a. The angle of reflection is equal to the angle of incidence
b. The incident ray, reflected ray and normal at the point of incidence all lie in one plane.
Unlike light waves, sound waves do not require a smooth and shining surface for reflection. They can get reflected from a surface that is either smooth or hard. The only criteria for the sound to be reflected is that the size of the reflecting surface must be bigger than the wavelength of the sound waves. This phenomenon is utilized in megaphones, soundboards, ear trumpets and so forth.
The sound that is heard after reflection from a distant object or obstacle like a cliff, wall of building and so forth after the original sound has ceased is called an echo. This concept is used in medical applications to do ultrasound scans of different parts of the body (ultrasonography) like echocardiography to get a graphic outline of the heart’s movement, pumping action of the heart and so forth. Ultrasound energy is also used to break kidney stones.
Animals have different range of audible frequency. The upper audible frequency range of bats, dolphins and dogs are much higher than that of human beings. Bats can produce and detect the sound of very high frequency up to about 100 kHz. Bats can locate the obstacle with the use of echoes so that they can fly safely without colliding with it. This process of detecting the obstacle is known as sound navigation and ranging. Ultrasonic waves are used by dolphins, whales and so forth in underwater for navigation, communication and hunting for prey as part of their existence in the oceanic environment. There are innumerable applications of ultrasound, besides medical applications, such as ultrasonic cleaning, welding, stitching, cutting, repelling birds/animals like dogs, rats, bats, birds and so forth.
Audible sound can be characterised in two ways. One is through objective characterisation using measurable qualities such as sound pressure level (SPL), frequency, sound intensity and so forth. The second way is through subjective characterisation using non-measurable human perceptions such as loudness, pitch, timbre (quality). In musical acoustics, we often use terms like pitch, loudness and timbre to describe the musical quality on the basis of our perception in a subjective manner. This may differ from person to person. But in medical acoustics when one wants to use ultrasound to break kidney stones, we use measurable quantities such as sound intensity, frequency and SPL to characterise the sound in an objective manner. This setting up of the values of these quantities remains the same irrespective of the operator who is working with the equipment in a given situation.
The human ear accommodates a very high range of fluctuation or disturbance of sound pressure levels of the order of 108 (also called dynamic range). This forces us to use a logarithmic unit called decibel for the measurement of sound levels. Linear units cannot be used to accommodate such large fluctuations of sound! The letter B in dB is always a capital letter which is introduced to commemorate the inventor of the telephone, Graham Bell.
We can use the dB scale to represent sound power, sound intensity and sound pressure as given below:
ii) Sound Intensity Level, LI = 10 log10 [I/Iref] dB Iref = 10-12 W/m2
iii) Sound Pressure Level,LP = 10 log10 [P2/ P2ref] dB Pref = 20. 10-6 N/m2
Pref = 20 10-5 N/m2 is used because it corresponds to the threshold of human hearing.
Similarly, Iref = 10-12 W/m2 is used to represent the threshold of hearing (plane wave approximation).
Sound power gives an idea about the power capacity of the sound source which causes the disturbance. The propagation of such disturbance in the medium can be measured in terms of sound intensity. The term ‘sound pressure’ is used to explain the loading pressure of such disturbance in the ear as an effect. Sound power and sound pressure are scalar quantities whereas sound intensity is a vector quantity.
As mentioned earlier, the human ear responds differently to sounds of different frequencies. Extensive audiological surveys have resulted in the introduction of weighting factors called A, B, and C for sound levels below 55 dB, below 85 dB, and above 85 dB respectively. For example, if somebody is exposed to machinery noise of sound pressure level of 70 dB ref to 20 µPa measured using A weighing factor network, then we write it as 70 dB(A) or 70 dBA.
For practical reasons, measurement of sound pressure is far more common than measurement of sound intensity. Hence, a decibel formula that makes use of pressure ratios rather than intensity ratios was derived. This derivation was based on the fact that intensity is proportional to pressure squared. The standard reference for the dB SPL scale is 20 µPa. However, any level can be used as a reference as long as it is specified.
The dBHL scale, used widely in audiological assessment, was developed specifically for measuring sensitivity to pure tones of different frequencies. The reference that is used for the dBHL scale is the threshold of audibility at a particular signal frequency for the average, normal-hearing listener. In particular, the ear is more sensitive to mid-frequencies sound between 1000 and 4000 Hz than it is at lower and higher frequencies. The complex shape of this curve provides the underlying motivation for the dBHL scale.
Noise is defined as ‘unpleasant sound’, ‘disagreeable’ or ‘undesired sound’. Sometimes it may happen that sound to one person can very well be noise to somebody else. The intense noise as a serious health hazard is now well-recognised with increased human civilisation in modern times. Heavy industry, high-speed machinery, increased vehicular traffic, urbanization and so forth have created a multitude of noise sources which have accelerated noise-induced hearing loss problems among the exposed population.
India has adopted the international standard limit of 90 dBA during an 8-hour shift for workers. Limits for durations other than 8 hours are provided in Table 3.1.
Sound pressure level is the strength of the sound in Pascals measured in dB scale at a distance of 1m with reference to 20 µPa. Listed below are some examples of SPLs of common occurrence:
• Whisper – 30 dBA
• Normal conversation – 60 dBA
• Shout – 90 dBA
• Discomfort of the ear – 120 dBA
• Pain in the ear – 130 dBA
A constant hearing of sound at a level above 100 dB can cause headaches and permanent hearing damage. The safe limit of the level of sound for hearing is from 0 to 80 dB.
The Ministry of Environment and Forests (MOEF) of the Government of India, on the advice of the National Committee for Noise Pollution Control, has been issuing Gazette notifications prescribing noise limits as well as rules for regulation and control of noise pollution in urban environment. These are summarised below:
The noise pollution (regulation and control) rules, 2000:2
These rules make use of limits as indicated in Table 3.2 for the ambient noise that is allowable.
i. A peripheral noise speaker or a public address system shall not be used except after obtaining written permission from the authority.
ii. A loudspeaker or a public address system or any sound-producing instrument or a musical instrument or a sound amplifier shall not be used at nighttime except in closed premises for communication within, like auditoria, conference rooms, community halls, banquet halls or during a public emergency.
iii. The noise level at the boundary of the public place, where a loudspeaker or public address system or any other noise source is being used, shall not exceed 10 dB(A) above the ambient noise standards for the area (see Table 3.2) or 75 dB(A). whichever is lower.
iv. The peripheral noise level of a privately owned sound system or a sound-producing instrument shall not, at the boundary of the private place, exceed by more than 5 dB(A) the ambient noise standards specified for the area in which it is used.
v. No horn shall be used in silence zones or during nighttime in residential areas except during a public emergency.
vi. Sound-emitting firecrackers shall not be burst in silence zone or during nighttime.
vii. Sound-emitting construction equipments shall not be used or operated during nighttime in residential areas and silence zones.
The terms ‘intensity’ and ‘pressure’ denote objective measurements that relate to our subjective experience of the loudness of sound. Intensity, as it relates to sound, is defined as the power carried by a sound wave per unit of area, expressed in watts per square meter (W/m2). Power is defined as energy per unit time, measured in watts (W). Power can also be defined as the rate at which work is performed or energy converted. Pressure is defined as force divided by the area over which it is distributed, measured in newtons per square meter (N/m2) or, more simply, Pascals (Pa).
In relation to sound, we specifically look at air pressure amplitude which is measured in Pascals. Air pressure amplitude caused by sound waves is measured as a displacement above or below equilibrium atmospheric pressure.
The greater the intensity or pressure created by the sound waves, the louder is the sound. However, loudness is only a subjective experience – that is, it is assessed by an individual saying how loud the sound seems to him or her. The relationship between loudness and air pressure is not linear. One cannot assume that if the pressure is doubled, the sound seems twice as loud. In fact, it takes about 10 times the change in the air pressure for a sound to seem twice as loud. It is like saying that 10 mixers in the kitchen can produce a sound level twice as loud as one mixer unit.
Another subjective perception of sound is pitch. The pitch of a note is how ‘high’ or ‘low’ the note seems to an individual. The related objective measure of pitch is frequency. In general, the higher the frequency, the higher is the perceived pitch. But once again, the relationship between pitch and frequency is not linear. Also, our sensitivity to frequency differences varies across the spectrum, and our perception of the pitch depends partly on how loud the sound is. A high pitch can seem to get higher when its loudness is increased, whereas a low pitch can seem to get lower. Context matters as well in that the pitch of a frequency may seem to shift when it is combined with other frequencies in a complex tone.
Each sound has a specific tonal quality which is called timbre. It makes the sound produced by any instrument different from others, even if they play the same note. For example, a guitar and piano can play the same note simultaneously but still sound different because of their unique tones. The uniqueness comes because of several harmonics and overtones which are produced simultaneously along with the fundamental.
The sound perceived by the listener is directly related to the physical characteristics of the sound wave. For a given frequency, the greater the pressure amplitude of a sound wave, the greater is the perceived loudness.
One of the important factors is that the ear is not equally sensitive to all frequencies in the audible range. A sound at one frequency may seem louder than one of equal pressure amplitude at a different frequency.
Speech is a special ability possessed by human beings and acquired along the path of evolution. Cognition-enabled speech with multi-linguistic abilities helps in communication. Speech is the product of sound produced by involving the central brain, peripheral nervous system and the aerodigestive tract.
When people talk to each other, a great deal of activity happens in their brains, which are storehouses of all information about the language they are using for speech communication.
The speech centres of the brain carry information about the phonology of the language, intonation, rhythmic pattern, the grammatical and syntactic procedures which govern speech and the very extensive vocabulary to string into a communicative language which is carried out by neuromuscular instructions to different muscles involved to produce the desired speech.
This is generally explained using the well-accepted source-filter model in which the voice is considered to involve two processes: the source of sound produced from air ejected by lungs and converting this into intelligible speech by vocal tract filter . To explain further, the larynx produces a sound whose spectrum contains many different frequencies. Then, using the articulators, the tongue, teeth, lips, velum and so forth (Figure 3.1), the raw sound spectrum gets modulated to include language, phonology and tonal quality to make it a sensible voice information to the listener to understand.
The air expelled from the lungs carry sufficient sound energy. At the larynx, this airflow passes between the vocal folds and through the vocal tract with constrictions. A set of perfect vocal folds has the following characteristics:
• Being open about half the time and closed about half the time in a vibratory cycle
• Letting air out in between the vocal folds in measured puffs
• No air leakage during the closed phase
• Vibrating symmetrically.
Loudness and pitch of the voice are dependent on lung capacity as well as on the anatomy of vocal folds such as thickness, mucosal wave pattern, tension, mass of vocal folds and its movement during the phonation.
In a voiced speech, the vocal folds vibrate while allowing puffs of air to pass, producing modulated sound. The modulated sounds produced in voiced speech usually contain a set of different frequencies called harmonics. Harmonics are basically multiples of fundamental frequency which is the frequency of vibration of vocal folds.
In whispering sound, the folds do not vibrate but are held close together with a small gap. When air moves through a small gap, the airflow becomes a turbulent flow with air vortices. Otherwise, the flow would have been a laminar flow. This is explained by the Reynolds number. A turbulent flow produces characteristic sound comprising a mixture of very many frequencies. The sound consisting of many frequencies is called broadband sound which is characteristic of whispering.
This analysis tells us that speech sounds are of two classes: voiced sounds, produced by the vibration of the vocal folds, and unvoiced sounds which are produced by other effects, such as whispering. Unvoiced sounds are also present in normal speech. This being an introductory chapter on this complex subject, the authors do not want to go into more details. Interested readers can refer to advanced works on this subject.
Frequency – It is the number of cycles per second. The unit of frequency is Hertz (Hz), named after the German scientist Heinrich Rudolf Hertz. A sound of 1000 Hz means 1000 cycles per second.
Decibel (dB) – It is 1/10th of a Bel and is named after Alexander Graham Bell, the inventor of the telephone. It is not an absolute figure but represents a logarithmic ratio between two sounds, namely the sound being described and the reference sound.
Pure Tone – A single frequency sound is called pure tone.
Complex Sound – Sound with more than one frequency is called complex sound.
Pitch – It is a subjective sensation produced by the frequency of sound. The higher the frequency, the greater is the pitch.
Overtones – A complex sound has a fundamental frequency, that is the lowest frequency at which a source vibrates. All the frequencies above that tone are called the overtones. It determines the quality or timbre of the sound.
Voice – Voice, sometimes called ‘vocalization’, usually refers to sounds that are produced at the laryngeal vocal folds, mostly through the air from the lungs. In humans, vocalizations comprise the fundamental components of speech, but not all vocal sounds are part of the speech spectrum. Indeed, we utter many involuntary sounds from cough to an infant babble that are generated at the vocal folds but are not morphed into articulated sounds of speech.
Phonation – Is the term that describes the production of the voice via vocal fold modifications. The essence of voice depends on vocal fold anatomy, physiology and neuromuscular control.
Speech – It is the verbal vocal communication.
Language – It is the global expression of human communication – spoken, written or gestural, consisting of words in a structured, ordered and conventional manner.
Most definitions have been internationally standardised and are listed in standards publications such as IEC 60050-801 (1994).
• Sound energy is a form of mechanical energy produced by a vibrating object which creates a disturbance in the particles of a medium in which it travels.
• Sound energy travels through a medium in the form of longitudinal waves consisting of compression and rarefactions.
• Human ears are sensitive to the 16 Hz to 20,000 Hz frequency range, which is known as audible frequency.
• Human speech frequency ranges between 125 Hz and 8 kHz.
• Sound frequencies below 20 Hz are known as infrasonic sound. Frequency above 20,000 Hz is known as ultrasonic sound. Both are inaudible to human ears.
• Sound is described objectively using measurable quantities such as sound pressure level (SPL), frequency and intensity.
• Subjective description of sound is non-measurable and it depends on human perception. It is described in terms of loudness, pitch and timbre.
• Speech is the product of sound initiated in the brain, coordinated by the peripheral nervous system and produced by the aerodigestive tract. It is further modified in the oral cavity by articulators to convert into sensible voice information which can be understood by the listener.
The benefits of music on our mental health are much better understood today than they were even a decade ago. For centuries, music has been touted as beneficial in stressful situations, but it is only recently that the study of music as a therapeutic has taken a structured and scientific approach. Thanks to the latest imaging and diagnostic techniques, how sound, and in particular music, impacts stressors has become clearer.
In the 21st century, and even more so over the last few decades, the relevance of the effect of music on our well-being has risen greatly, and it has never been more important than it is now. This can be attributed to the fact that access to music today is as ubiquitous as access to the internet. For example, the channel with the largest subscriber base on YouTube, the world’s largest video streaming platform with over a billion daily users, is T-Series, an Indian music label that boasts the world’s biggest collection of Indian film songs.1
The internet has played a key role in introducing audiences to a global pool of music. A simple Google search with keywords such as ‘calming music’ or ‘soothing music’ will give a plethora of options including playlists of Buddhist chants, Chinese flutes and even the sounds of rainfall and the rustle of leaves in the wind. Leading music streaming providers now have dedicated playlists that are regularly curated by experts and focus on putting the listener in a better frame of mind. Even mental health applications such as Calm and Headspace use soothing background sounds and music during their online sessions. The relevance of music as a digital therapeutic can be directly tied back to the growth in audio recording and distribution industry. To understand the scope of music as a digital therapeutic in the age of music streaming and on-demand availability, it will be helpful to first understand the history of music recording and distribution.
This chapter aims to give the reader an insight into the ubiquity of music and its effects on finding new use cases for music – namely, as a digital therapeutic. Music has been known as a mood-changer for centuries, but the scientific method is now helping us channel music’s power to help tackle some of our biggest challenges.
• The history and science of music recording
• The role of internet in transforming music distribution
• The introduction of music as a part of the experience by new-age internet businesses
• How music has become a digital therapeutic
• How music as a digital therapeutic is helping to solve difficult problems
Prior to music recording, live performances were the only way for audiences to experience music. Such performances were generally community events, with the music performance being either the main attraction or a part of the ambience such as during a wedding. In the 17th and 18th centuries, live bands became increasingly common as background music became a standard feature of social gatherings. The invention of sound recording has been attributed to Thomas Alva Edison, who invented the phonograph in 1877.2 However, the first audio recording device was actually invented by the lesser-known French scientist Édouard-Léon Scott de Martinville.3 Scott’s device was patented in 1857, a good two decades before Bell’s telephone and Edison’s phonograph. Scott’s idea behind the phonautograph was to build a device that would set sound to paper, just like the camera set an image to paper. A phonautograph is almost like a seismograph for general audio. Scott built his new device to mimic the workings of the human ear – a vibrating membrane was connected to a stylus, which would trace the movement of the membrane on a piece of paper. With an incoming sound, the membrane would vibrate and the stylus would capture this wave on a piece of paper. With training, a person could theoretically ‘read’ the sound wave and make sense of it. This was, however, much easier said than done. Over many iterations, Scott improved the accuracy of his traces. Incredibly, it did not occur to him to play this sound back into the air! Enter Edison and his phonograph in 1877. It would be fair to say that Edison was the first to both record and reproduce sound. He applied for a patent for his invention in December 1877,4 and in typical Edison style, his invention caused a significant buzz in the scientific world. While the theory of how to record and reproduce sound was sound (pun intended), the phonograph needed significant improvements to become mainstream and commercially viable.
With Edison focusing his attention on other projects, the Volta Laboratory in Washington DC, headed by another prolific inventor Alexander Graham Bell, pushed the boundaries of this new technology and invented what would eventually become the dictaphone and the gramophone. Edison, who was focusing his energies and efforts in building New York City’s light and power system, had sold his patent to a company owned by Gardiner Green Hubbard. Hubbard was no stranger to Bell – Bell was his son-in-law, and Hubbard succeeded in enticing the inventor of the telephone to take up the challenge of improving Edison’s invention which was not ready for mass adoption. In the next 10 years, Bell and his associates at the Volta Laboratory successfully solved many of the technical challenges that halted the adoption of the Edison phonogram. The inventor group also went on to file patents and incorporate companies to license their technology, and in essence created the market for sound recording and sound playback instruments. The ‘acoustic era’ of sound recording, which lasted for close to 50 years from the invention of the phonograph, was characterised by the mechanical nature of recording and playback of sound. A paradigm shift occurred in 1925, when Western Electric introduced electronic microphones, amplifiers and signal recorders, and this marked the end of the acoustic era and the beginning of the ‘electronic era’ of sound recording. The electronic era of sound recording is significant because it allowed for the post-processing and editing of sounds for the first time. Sound amplitude could now be adjusted appropriately. This meant that noise could also be reduced in recordings – a major problem that the original dictaphones could not solve. With the frontier of accurate voice recording being passed, the electronic era of recording opened up the space of music recording and created a market for music distribution.
Like almost all early technologies, sound and music recording had its initial problems – there were almost no standards, discs were fragile and reproduction quality degraded precipitously. Recordings were also grainy and unclear, which significantly reduced adoption. Finally, even with mass manufacturing, the cost of a buying a record player and discs was not small, which kept this marvelous innovation available to only the more affluent sections of society.
This changed with the emergence of the magnetic tape, which ushered in the ‘magnetic era’ of sound recording. The electronic era had already solved for sound quality, and the magnetic era now solved for the robustness and the cost of the distributed material. Magnetic tapes were significantly cheaper to procure and use as storage, and also lasted longer than their predecessors. With great sound quality and an accessible storage media, the music recording business exploded. For the first time in history, recorded music royalties outstripped earnings from live shows. Improving quality and decreasing storage and distribution cost is a time-tested method to build any market and industry. In this case, technology created a recording and distribution industry, which led to the creation or forced the evolution of many other industries. For example, man’s ability to record and distribute sound and make it accessible to millions led to a complete overhaul of our understanding of copyright and intellectual property.
If there is one rule of business, it is that no industry, however efficient, is safe from disruption through technology. In 1989, Tim Burners-Lee developed a file sharing protocol and standard that eventually became the World Wide Web. And with the creation of the audio standard mp3, the magnetic era was also officially over. Like pretty much everything else in our lives today, audio and music became a digital good. For over two decades, however, they were consumed physically – usually in the form of CDs. As core internet infrastructure became better and as people got faster access to the web, the digital aspect of music started taking over. Mp3 allowed audio files to be downloaded freely off any server, and music distribution, once again, played a significant part in redefining our understanding of intellectual property. While the famous (or rather infamous) Napster v Metallica 5 case did put a brake on digital music distribution, on-demand access to music was an idea whose time had come.
A decade ago, one had to buy physical copies of an album, buy it off iTunes or risk one’s system by illegally downloading music (Bollywood music fans would be all too familiar with sites such as songs.pk). Today, for an affordable sum, or in exchange for listening to some creative audio advertisements, people have access to the whole world’s music literally at the tips of their fingers. The music streaming industry has seen double digit year-on-year growth for the past four years, and services such as Spotify, Apple Music, YouTube music and Gaana are household names. Hidden in these numbers, however, is an interesting observation. The fastest growing sub-segment in terms of revenue in music streaming is not direct-to-consumer, but in-service streaming. In-service streaming is when a piece of music is streamed as a part of a larger experience being delivered to an end consumer. For example, fitness service Peloton streams music as a part of an online workout. The popular meditation and mental wellness app Calm streams music to their users as a part of their guided mindfulness and therapy sessions. For these services, and for their millions of users around the world, music is an integral part of an experience that helps them feel better. While we never needed empirical proof of music’s ability to put us in the right frame of mind, these examples give us proof in a metric that is possibly even more valid than the results of a double-blind study: money.
Digital therapeutics is one of the hottest fields in medicine today. Mobile apps and video games, hitherto thought of as play activities, have recently gained traction in helping people cope and recover from severe ailments.6 With software and new technologies touching every part of our lives, and with sensors embedded into pretty much everything we touch, even delivering care is going digital. The use of VR to treat patients suffering from PTSD is a perfect example of a new-age digital therapeutic. While digital therapeutics is an umbrella term for any software intervention in the care process, we will particularly focus on the use of music as a digital therapeutic to show its true potential.
By using functional neuro-imaging (f-MRI), it has been shown that music affects the brain by triggering signal activities in parts of the brain.7 This clearly supports the case for music being used as a therapeutic and not just an auxiliary in the course of treatment. Streaming technology has allowed us to insert our choice of music into specific contexts, and this has enabled us to dynamically measure the impact of different types of music in different situations. One of the most studied situations in this regard is the use of music in treating depression. Estimates say that over 300 million suffer from clinical depression, and music therapy could potentially offer an alternative to strong chemical interventions. Studies8, 9, 10 have already proven music’s role in our mental wellness to be statistically significant, but how far are we from experiencing the benefits ourselves? Imagine a situation where, depending on your mood, your music streaming provider picks out a playlist that has been clinically validated using functional imaging studies and proven to improve your state of mind and wellness. Sounds incredible, right? It is. And it is not far from reality – music streaming giant Spotify is already investing heavily into music and sound therapy that will automatically pick a curated playlist based on your mood.
A recent study published in the Frontiers in Psychology 11 underscored the power of music to positively affect our well-being by citing a multi-country study done during the first wave of the COVID-19 pandemic. The study focused on the effect of music on possibly the most stressed demographic in the world at the time – healthcare workers. The findings showed that music promoted emotional well-being for hospital clinical staff in Italy by reducing their feelings of fear, sadness and worry. In Australia, a positive association was found between music listening and life satisfaction. Other research done during the pandemic, combining stats from both music listening and music playing, showed that music was considered the most helpful coping activity in the United States, Italy and Spain.12
The effect of our better understanding of music as a drug and its near-universal availability means that it is now the drug of choice for many. Mental health app sessions have grown 66% year on year and the expected valuation of global mental health apps went up 27% since 2019. The use of music for coping with anxiety and depression, both on and off these platforms, has also grown proportionately. In addition to being the drug of choice for many, music is quickly becoming a key differentiator, with the ability to increase both engagement and efficacy for digital wellness and mental health platforms. As mentioned earlier, leading meditation app Calm is quickly becoming an established music label, leveraging direct artist relationships and competing head-to-head with major record companies in the music and wellness arena.
An area of cutting-edge research where music therapy is showing huge promise is in Stroke Rehabilitation. Sometimes you just can’t help but move to the beat of a song, right? It turns out that Brian Harris, a certified music therapist based in Boston, is using exactly this phenomenon to help patients recover their limb movements after a stroke. Harris specialises in Neurological Music Therapy (NMT). NMT involves using music and feedback to train the brain during physical therapy. According to Harris, many of his patients have shown significantly improved mobility post NMT. NMT works because of an underlying phenomenon called Audatory-Motor Entrainment. In essence, the subconscious mind has a mapping between the audatory system and the motor system. When we hear music, the two systems work in sync, just like two wheels connected by an axle, and we start walking to the rhythm of the music. A team at Stony Brook University found in 2019 that listening to music, combined with feedback based on their gait, helped people with Parkinson’s disease walk better. Scientists in the field are excited because they believe we are just scratching the surface. A start-up called MedRhythms is already making waves, having raised USD 25 million; they have also tied up with Universal Records, the world’s largest music studio, and are gearing up to have their music-based therapeutics submitted to the FDA for approval. If all goes according to their plan, they could be the first ever prescription music!
While music’s potential for positive impact in our lives is clear, it would also be prudent to mention the potential downsides. There is a body of research which shows that certain types of sounds and music can lead people to be more aggressive and encourage crime. A study in the UK explored ‘drill music; – music that has threatening lyrics – and its correlation to attention-seeking crime. In India, the late Sidhu Moose Wala, a popular gun-toting hip-hop and pop artist, often linked his power and influence to his armaments. His music was recently banned for enticing youth to pick up arms, and sadly he became a victim of gun violence himself. While these incidents are quite common, there is a case to be made that such music, be it Drill Music or Moose Wala’s songs on the supremacy of gun-wielding Punjabis, is the consequence of a rise in crime, poverty, deprivation, inequality and lack of access to resources, and not the cause. It would be wise to listen to Eminem, the world’s best-selling hip-hop artist, when in his hit song ‘Sing for the Moment’ he asks rhetorically, ‘They say music can alter moods and talk to you / But can it blow a gun up and cock it too?’
Music as a drug will soon be a scientific reality. While it would need that stamp of approval to fit into our definition of a scientific object, music has been a getaway drug for billions throughout centuries. One could go as far as to say that music is one of the essential ingredients of life. For if there was none in my life, I’m not sure it’d be worth going on.
So I say
Thank you for the music, the songs I’m singing
Thanks for all the joy they’re bringing
Who can live without it? I ask in all honesty
What would life be?
Without a song or a dance, what are we?
So I say thank you for the music
For giving it to me1
The tremendous ubiquity of music can be attributed to the incredible advances in technology for recording, storage and distribution. It is this same characteristic of music today, its ubiquity, that allows us to find new use cases for music. While music has been used as a mood-changer for centuries, it is only recently that we have been able to scientifically understand its impact on our minds and our moods. Music as a digital therapeutic has shown many early successes and has the potential to become a part of standard medical prescriptions.
Hearing and speech are perhaps one of the most valuable gifts of nature to mankind. Though the perception of sound is important to every animal, its importance for man is unique in the sense that language and communication to which the whole development of the human race is indebted solely depends on the capacity to hear sounds and communication by speech.[ 1] Though in early life the perceptive frequency range of man is 20 Hz (cycles per second) to 20,000 Hz, as age advances this range narrows down to a great extent. By the age of 50, the vast majority do not appreciate sounds above 8000 Hz. This decline is marked and happens much early in those persons with a history of chronic noise exposure and co-existing lifestyle diseases such as diabetes mellitus.[ 2]
The intensity range of human hearing is so huge that it is expressed in terms of a logarithmic ratio. If we consider the minimum threshold of hearing of a healthy person as 1 (in logarithmic value it is zero) the maximum capacity to hear without pain that one can perceive is 14 raised to 10 (140 dB or 1000 trillion times of minimum threshold of 0 dB).
• Understanding anatomy of the ear (external, middle and inner)
• Nerve connections of the inner ear to the brain
• How the external and middle ear compensate for the loss of sound energy when the sound travels from the air medium to the fluid medium of the inner ear
• Mechanism of converting sound energy into electrical impulses by the hair cells inside the inner ear (cochlea)
• Final understanding of speech by the cerebral cortex
• Sound production (phonation) by the larynx
• Modulation of the sound from the larynx by throat and mouth (resonance and articulation) to produce legible speech.
When sound waves, which are indeed pressure waves, reach the tympanic membrane (TM), they cause vibrations of TM’s moving part or what is commonly called the “ear drum”. These vibrations from the TM are transmitted to the oval window (OW) which is situated in the medial (inside) wall of the air-filled middle ear (ME) where the smallest bone in the body, stapes, is attached. Vibrations from the TM are transmitted through the three small bones, namely malleus, incus, and stapes. (Figure 5.1)[ 3]
The stapes bone, one of the ossicles connecting the TM to the inner ear, works like a piston that moves in and out of the inner ear (cochlea). The cochlea is filled with a fluid namely perilymph. The amplitude of these vibrations depends on the loudness of the sound, and the frequency depends on the pitch of the sound.
Inside the cochlea, lying in the perilymph is a tubular structure, the membranous cochlea, which is filled with another special fluid, the endolymph. This endolymph is the only extracellular fluid (fluid outside the cells) that has more potassium ions than sodium, whereas all other extracellular fluids have the reverse.
The specialized organ which converts sound energy into electrical impulses (action potential ) is inside this membranous cochlea of the inner ear and is known as the “organ of corti”, which has specialized sensory cells called “hair cells” (HC). Hair cells have long hair-like structures called “kinocilium” (long) and “stereocilium” (short) (Figure 5.2)[ 4]
Normally, when sound vibrations travel from air to fluids, most of the energy reflects back and only 1/30th of the energy travels into the fluid media. This resistance to sound vibrations travelling to a fluid media is called “Impedance”. The same resistance happens to sound waves entering the inner ear fluid. To overcome this disadvantage and to pass maximum sound energy into the inner ear fluid, we have the special structure of the middle ear which includes the TM and its ossicular chain (OC), which helps in compensating and preventing this loss of energy to a great extent. This mechanism of overcoming the resistance is otherwise called “impedance matching” (compensating the impedance). [ 6]
It is to be noted that the total vibrating area of TM is 542 mm which is 17 times more than the surface area of the OW (3 mm), through which the sound vibrations enter the inner ear fluids. Thus, the energy or force reaching 542 mm of TM is focused on 32 mm of OW, or, in other words, the pressure reaching OW per sq mm is 17 times more when compared to that reaching TM. Also, the ossicles have a lever effect of 1.3 times because of the higher length of the malleus handle (1.3 times that of the incus), which is connected to the TM, compared to the length of the incus which is connected to the stapes. Though this lever effect reduces the amplitude of the movement of the stapes, the force increases by 1.3 times. The product of TM, OW ratio (17) and lever ratio (1.3) comes to 22, which is the total force amplification at the OW. Even though this amplification does not fully compensate for the loss of energy as the sound vibration travels from air to the inner ear fluid, the perilymph, which as mentioned earlier is 30 times, this middle ear mechanism helps to compensate to a great degree of the loss (impedance).[ 6]
If there is any disruption in the middle ear, ossicular chain or large perforation in the TM, this middle ear mechanism does not work and the force advantage is lost by only a small part of the sound energy reaching the inner ear. Thus, the person develops hearing loss and it is called “conductive deafness” because the problem is in the conduction of sound waves to the inner ear.
The cochlea is a coiled bony structure having 2 and 3/4 turns and situated medial to the ME. The part near the ME corresponds to the basal turn. The tip of the coil is directed to the front and the inside, towards the brain. At the centre of the cochlear coil, modiolus (at the brain facing side), emerges the cochlear nerve (CN), which carries impulses to the brain. When the sound pressure reaches the inner ear (cochlea), through the piston-like movement of the OW, it produces a positive pressure in the adjoining perilymph of the scala vestibule compartment, and this pressure is transmitted to the corresponding membranous cochlea, scala media (SM). The pressure inside the SM produces deflection of the basilar membrane (BM) downwards to the scala tympani (ST), which is the second compartment filled with perilymph. The scala tympani has a window towards the middle ear covered by a membranous structure, and this is called the “round window” (RW) (Figure 5.3).[ 5]
When the positive pressure due to the deflection of BM reaches the ST, the RW bulges into the ME from inside, thereby releasing the pressure wave.
There are sensory nerve cells on the BM called “hair cells” (HC). These hair cells are specialised nerve cells that can generate electrical impulses by the movement of the hairs, stereocilium (short) and kinocilium (long).[ 6]
The BM acts as a string attached to two ends of the cochlea. The deflection of the BM downwards at the base of the cochlea as a result of the sound pressure entering through the OW travels in a wave from one end of the cochlea to the other end. Depending on the frequency of sound, the location of maximum deflection of the BM varies. Thus, high-frequency sounds produce maximum deflection at the basal turn and low-frequency sounds produce deflection at the tip. Mid-frequencies have a maximum deflection at the middle part of the BM.
The deflection of BM induces a bending of the cilia of the hair cells, which are fixed over the BM; this movement induces an action potential (AP) inside the hair cells. This AP travels through the first-order auditory neurons which are situated in the modiolus (axis) of the cochlea.[ 3]
These first-order nerve cells in the modiolus are called “spiral ganglion” (SG). Impulses from the SG travel to the brain stem, cochlear nucleus (CN), of both sides. From there, it travels as a nerve bundle (lateral lemniscus) to the inferior colliculus of the mid-brain and to the medial geniculate body of the thalamus and from there to the auditory neurons in the temporal lobe of the opposite cerebrum, the auditory cortex. Part of the impulses reaches the same side auditory cortex also. It is the auditory cortex that perceives the sound and understands the meaning of what is heard.[ 6]
As mentioned earlier, different frequencies of sound produce maximum deflections at different locations of the BM. The auditory cortex understands the pitch by noting the location of maximum deflection.
Similarly, depending on the severity of deflection, the loudness is perceived by the auditory cortex. Thus, identifying the location of maximum deflection helps in perceiving the tone or pitch of the sound, and the intensity of deflection helps in identifying the loudness. This explanation is otherwise called “travelling wave theory” and was first explained by Von Bekesy.
The basal turn hair cells perceive high-frequency sounds and apex cells perveive sound at lower frequencies. As the basal turn is situated nearest to the OW and all the sound pressure reaches here first, the damage due to excess sound (noise) is also at this location. So in noise-induced deafness, high-frequency hearing is lost early.[ 2]
The function of the hair cell is to convert sound energy to electrical impulses, and any interference with this mechanism either due to congenital absence or degeneration of hair cells results in sensory neural deafness (SND). The deafness occurring in old age, otherwise called “presbycusis”, also mostly affects the hair cells inside the cochlea and so results in SND. Apart from the genetic susceptibility, whatever noise one is exposed to in one’s lifetime determines the age of occurrence and severity of presbycusis.
Damage to the spiral ganglion cells or its proximal neural connections to the brain results in nerve deafness (ND). Some of the causes of ND are infections like meningitis, degenerative diseases like multiple sclerosis, tumours like acoustic neuroma (AN) and even some of the types of presbycusis.
Generally, most of the conductive deafness which happens due to impaired transmission of sound waves to the cochlea can be treated very successfully, either by medical or surgical measures. Examples are wax in the external ear canal, secretory otitis (fluid inside the middle ear), suppurative otitis media (infection in the middle ear), otosclerosis (stiffening of the stapes ossicle) and so forth.[ 6]
Sensory neural deafness can be from birth or acquired (happening later in life). Common reasons for hair cell degeneration later in life are certain hereditary disorders, noise exposure either sudden or chronic, old age, drug toxicity, infections of cochlea either bacterial or viral, Meniere’s syndrome, toxicity of certain drugs, lifestyle diseases like diabetes and so forth. Fortunately, most of the SNDs can be managed well either by hearing aids or cochlear implantation. Whether to go for hearing aids or cochlear implantation in such cases depends on the severity and the cause of deafness. The vast majority of congenital deafness is due to improper development of hair cells and thus responds well to cochlear implantation. Normal hearing aids work by just amplifying the sound intensity and providing a stronger stimulus to remaining hair cells. Since in most of the cases some percentage of residual hair cells remain, most of the SNDs can be managed by providing a hearing aid. If the remaining hair cells are below a critical level, whatever stimulation we give will not be effective, and cochlear implantation is the treatment of choice.[ 3]
As we all know, speech is initiated in the larynx. This process of initiating voice is called “phonation”. Even though the larynx is the voice producer, it needs further amplification, modulation and articulation for a legible speech.
The larynx is a tubular structure situated between the pharynx and the windpipe (trachea). The pharynx is the common pathway for respiration and food passage connecting the nose and oral cavity with the trachea and the oesophagus. It is through the trachea that air passes to the lungs during respiration.[ 7]
The larynx is formed by seven cartilages held together by ligaments (connective tissue) and muscles. It is not only the voice producer but is also the part of the airway between the pharynx and the trachea. It also has a sphincter mechanism that protects the lungs from secretions and prevents food particles from the throat entering into the lungs. The cartilages of the larynx are thyroid (1), cricoid (1), epiglottis (1), arytenoids (2) cuneiform and corniculate (2).
The muscles of the larynx are either intrinsic or extrinsic. Intrinsic muscles connect the above cartilages, whereas extrinsic muscles connect the laryngeal cartilages with the adjacent structures such as the hyoid bone, sternum (chest bone) and so forth and help stabilize the larynx during swallowing and phonation. The larynx’s inner surface (lumen) is lined by a mucus membrane and has two folds on each side, lying front to back. The upper fold is called the “ventricular band” (false cord) and the lower vocal cord (VC) is the true cord. It is the true vocal cords which are vibrating during phonation. The mucous lining of true vocal cords is different from the rest of the larynx. It is lined by multi-layered cells without cilia and is called “the squamous epithelium”. This is so because the mucosa of the vocal folds is subjected to constant friction owing to the vibrations of the true vocal cords (Figure 5.4).[ 7]
The initiation of voice in the larynx is called “phonation”. This phonation in the larynx is a product of the expiratory airflow, modulated by the closure of the glottis (the space between true vocal cords of both sides) and the vibrations of the true VC. The vibrations of the mucosa of the VC is an involuntary act owing to the pressure of the air flowing through the narrow glottis. To develop enough pressure below (sub-glottis), the VC has to come together with proper stiffness. It is the specialized mucosa of vocal folds which is vibrating and not the entire VC. The muscles of the VC which give stiffness to the vocal cord are called “vocalis”. It is part of the thyro-arytenoid muscle which connects the thyroid cartilage with the arytenoid cartilage. The stiffness (tension) of the VC determines the pitch of the sound produced. Thus, if the VC is stiffer, the pitch of the sound also will be higher. Similarly, the length thickness of the VC also affects the pitch of the sounds. The longer and thicker the VC, lower is the pitch. That is why male voices have lower pitch compared to female voices. The force of expiratory air decides the loudness.
Thus, for the vocal cords to vibrate, two important requirements are needed:
1. The vocal cords have to come together at the midline (adduction) with a certain level of stiffness or tension.
2. The expiratory air has to pass through the space between the vocal folds with force. For this, the VC are to be held tight by the action of muscles, mainly the lateral thyroarytenoid including the vocalis part and the cricothyroid muscles. The loudness of the sound depends mainly on the force of the expiratory air passing out from the lungs. Depending upon the mass, length and tightness of the vocal folds, the pitch of the sound varies.[ 8]
The sound thus produced by phonation is not only weak but also monotonous. It is amplified to a great extent by the resonance (synchronized vibrations) of the pharynx, nose and oral cavity including the cheek and teeth. For legible speech, articulation and modulation of the laryngeal sounds are essential. The character of the voice of a particular person depends on these factors also. As mentioned earlier, the pharyngeal muscles, the tongue, teeth, lips, soft palate and nasal structures all contribute to that person’s articulation and final character of the voice.[ 8]
Speech production is a highly complex process in which the cortical, sub-cortical brain, cerebellum and their interconnections play a very vital role. The real initiation of speech starts in the prefrontal area of the brain which receives inputs from the auditory cortex (hearing area) and the limbic system related to emotions. From the pre-frontal cortex, impulses pass to the Broca’s area of the cerebrum, which by coordinating with the cerebellum instructs the motor cortex and the respiratory centre in the brain stem, thereby instructing the required movements of the muscles of the larynx, chest, pharynx and tongue and mouth.
Our voice is a mix of many frequencies. Each person has a characteristic voice and speech specific to him. The voice of a particular person has a specific predominant frequency called “fundamental frequency”. Generally, for males this fundamental frequency lies between 80 Hz (cycles/sec) and 160 Hz. For females, it is almost double, that is between 160 and 260 Hz. Most of the time, as mentioned earlier, the voice and speech of a person can be easily identified by another familiar person. The character of the voice of a particular person depends on the fundamental frequency and the overtones produced by the larynx, its modulations and resonance by the pharynx and oral cavity. Not only the anatomical differences but also the tone, the stiffness and the contractions of the laryngeal and related muscles play their part in determining that person’s speech characteristics.
The acumen of speech and hearing are very much interrelated and is the most important factor for every intellectual activity by a person. This chapter explains the complex anatomy and the mechanism of the working of the auditory system, which includes the ear (external, middle and inner ear) and the auditory cortex in the brain with its interconnections. Voice production is a very complex process first initiated in the brain with phonation in the larynx, modified by resonance and articulation by the throat, oral cavity, tongue, cheek and nasal cavity.
• The perception of the different frequencies and the loudness is by the location of maximum deflection of BM. This phenomenon was well explained by the travelling wave theory of Von-Bekesy.
• Depending upon the structures affected, hearing loss (HL) can be conductive, sensorineural or neural.
• Most of the deafness is preventable, especially middle ear infections, noise-induced deafness and so forth. Even old-age deafness (presbycusis) is preventable by avoiding exposure to loud noise.
• Both conductive and sensorineural deafness can be satisfactorily managed either by medical or surgical methods.
• Some cases of SND or neural deafness require hearing aids.
• The pitch of the voice depends upon the length and thickness of the VC.
• The loudness of the voice is decided by the air pressure developed in the subglottic area, just below the VC. This also depends upon the respiratory capacity of the individual.
• If we misuse or overuse the larynx by continuous speech or by irritants such as smoking or alcohol, the VC can fail early producing inflammation, nodules or even atrophy, resulting in temporary or permanent voice changes. If the VC is subjected to continuous irritations, it can even lead to cancer changes in the larynx.[ 9]
Sound is a form of energy, just like heat, electricity or light. For both animals and humans, the sense of hearing helps to experience the world around them through sound. Some sounds are pleasurable, and some are annoying. All of us are subjected to different types of sound all the time. Sound waves are created by the vibration (to and fro motion) of objects. Sound waves travel through a medium (e.g., air, water). Among the many types of sounds around us, the ringing of a bell is an example of sound caused by vibration. One can experience the vibration by putting a finger on the bell after striking it. Some sounds are audible with visible vibrations causing them, while some aren’t. If a rubber band is pulled, stretched and released, it moves to and fro about its central axis. This causes a sound that can be heard and seen. Sounds below 20 Hz and above 20 kHz are not audible to humans.
While all of us hear and experience sounds, many of us are often not aware how sound is measured and how the technique of capturing sound changes depending on the origin and nature of the sound. In this chapter the reader will learn the following.
• The different types of sound present in the universe
• Measurement of sound
• History of the microphone
• Microphone and its classification
• Unit of measurement
There are different sounds present in the universe, a few examples of which are shown in Figure 6.1. These sounds come from different sources, have different intensities and demonstrate different characteristics in time and frequency. They can be categorised into four types. (1) Natural terrestrial sounds: The sound of thunder, rain, and wind are natural sounds with frequencies ranging from 5 Hz to 250 Hz. Volcanic sounds are infrasonic, that is, below 20 Hz. (2) Natural extraterrestrial sound: These sounds emanate from satellites, comets, sun, planets and stars with frequencies ranging from 40 MHz to 40 GHz. (3) Human and animal sounds: These include human speech and sounds from different human organs such as the heart, lungs and intestine. Animal sounds refer to the sounds of different animals, both domestic as well as wild. The frequencies of these sounds range from 20 Hz to 20 kHz. (4) Human-made sounds: These include noise and sounds made by humans involved in various activities. Examples of such sounds include the sound of traffic, power transmission lines, and musical instruments. Most of these sounds have frequencies in the audible range.
This section explains the history of the microphone based on the patents at each stage. Almost all microphones are used to record speech signals, internal sounds of humans, noises, sounds of birds and animals, and so on.
Figure 6.2 illustrates the invention of various microphones with their years of invention. The invention of the very first microphone started in the early 1800s. The French physician and physicist Felix Savart created the sound level meter in 1830 which measured the noise level in decibels.1 In the late 1800s many scientists such as Alexander Graham Bell, Elisha Gray and Ettore Majorana worked parallelly on the microphone. But the first microphone was invented by Emile Berliner and Thomas Edison in 1876, while David E. Hughes independently created the same type of carbon microphone in 1878.2 The first carbon microphone was patented by Alexander Graham Bell in 1878. It consisted of a wire which conducted electrical current (DC). A moving armature transmitter and receiver generated and received audio signals, respectively, and transmission was possible in either direction. A liquid transmitter was part of the second microphone invented by Bell in 1876. Later Emile Berliner patented the design based on Bell’s liquid transmitter. It consisted of a steel ball set against a stretched metal diaphragm. Francis Blake developed a microphone by using a platinum bead.3 The designs of all three microphones are shown in Figure 6.3.
In 1917 the piezoelectric microphone (crystal microphone) and the hydrophone were invented by Paul Langevin2 and R. N. Ryan,4 respectively. In 1920 the earliest electret microphone was invented by the Japanese scientist Yoguchi.2 But the electret microphone is said to have been invented in the 1960s and a patent was awarded to Gerhard Sessler in 1962.2 The shotgun microphone was invented by Harry F. Olson in 1941.2 Later on, wireless technology became a trend and wireless microphones were invented in 1957 by Raymond A. Litke.2 Then the USB and Lavalier microphones came into the market. In 1983 the MEMS microphone was invented by D. Hohm and Gerhard M. Sessler,2 which became popular in the market due to its special features such as smaller size, low cost and high quality of sound. The fiber optic microphone was discovered in 1984 by Alexander Paritsky.5
This section explains the microphone and its classification, working principle, advantages and disadvantages with the help of figures.
The word ‘micro’ means small and ‘phone’ means sound. Thus, the word ‘microphone’ refers to a ‘small sound’, meaning it deals with small amounts of sound. A microphone consists of a transducer to convert mechanical energy/sound waves into electrical energy/acoustic signals.
Microphones can be divided into three types based on principles, polar patterns and applications. Figure 6.4 illustrates the classification of microphones.
Microphones are divided into three types on the basis of the working principle – dynamic, condenser and ribbon.
A. Dynamic microphone: This microphone consists of a diaphragm, a coil and a magnet. It works on the principle of electromagnetic induction. When a sound wave strikes the diaphragm, it moves back and forth and the coil, which is attached to the diaphragm, also moves backwards and forwards. The coil is surrounded by a magnet. It creates a magnetic field through which electric current flows. Figure 6.5(D) depicts a dynamic microphone. It is the most commonly used microphone as it captures sound from all directions while recording. It has a cardioid polar pattern and cancels out noise from its rear side.7
Advantages: A dynamic microphone is robust. It has the capacity to pick up high sound pressure levels and provides good sound quality. It is inexpensive and does not need power to operate.
Disadvantage: A dynamic microphone has poor high-frequency response due to the inertia of the coil, tube and diaphragm and the force required to overcome interaction between the coil and magnet. Hence it is not suitable for recording sounds of musical instruments (e.g., guitar, violin) with high frequencies and harmonics compared to the condenser microphone.7
B. Condenser microphone: This microphone consists of a couple of charged metal plates, one backplate which is stationary and another one which is a movable diaphragm (forming a capacitor). It works on the electrostatic principle. When a sound signal strikes the diaphragm, the distance between the two plates changes which changes the capacitance. The change in the spacing due to the movement of the diaphragm with respect to the stationary backplate creates an electrical signal. Figure 6.5(E) shows an example of a condenser microphone. Condenser microphones require an electric current to pick up signals, which is provided by either the battery or the microphone cable called phantom powering. It can operate with phantom power voltages ranging from 11 volts to 52 volts.7
Advantages: A condenser microphone is smaller than a dynamic microphone and has a flat frequency response. It supports a high range of frequencies due to a fast-moving diaphragm.
Disadvantages: A condenser microphone is costly, more sensitive to temperature and humidity, and there is a limitation to the maximum signal level the electronics can handle.7
Condenser microphones are divided into two on the basis of the size of the diaphragm – large (1 inch and above) and small (less than 1 inch).8
C. Ribbon microphone: This microphone works on the principle of electromagnetic induction. It has a thin, electrically conductive, ribbon-like diaphragm suspended within a magnetic structure. When sound waves strike the diaphragm, it moves back and forth within the permanent magnetic field inducing a signal electromagnetically across it. This microphone is most commonly used in radio stations. It picks up the velocity in the air and not just air displacement. It provides better sensitivity to higher frequencies and captures highly dynamic and high fidelity sounds.9 A ribbon microphone is shown in Figure 6.5(F).
Advantages: Because of their sonic characteristics, ribbon microphones are favoured for various applications. These special sonic characteristics provide an advantage over other microphones for smooth, warm and natural sound.9
Disadvantages: Ribbon microphones are bulky because they require huge magnets. For example, magnets in a ribbon microphone such as the classic RCA 44 weigh up to six pounds. These microphones are fragile.9
Microphones are classified according to the sound pick-up patterns / polar pattern and every microphone, irrespective of its working principle, has a polar pattern. There are eight types of polar patterns as summarised below.
A. Omnidirectional: The word ‘omni’ means uniformity in all directions. Thus, a microphone with an omnidirectional polar pattern picks up sound equally from all directions. Such a polar pattern is shown in Figure 6.6(A). Omnidirectional microphones are particularly useful for recording room ambience and capturing group vocals.11
B. Cardioid/Unidirectional: This polar pattern is heart-shaped, hence the name cardioid. It picks up sound from the front and offers utmost rejection at the rear which is shown in Figure 6.6(B). It has a null point at the rear (180°) and 6 dB decline at its sides, that is, 90° and 270° compared to the on-axis, that is, 0°. It is suitable for live performances and situations where feedback suppression is required. It is most commonly used in studios.11
C. Hypercardioid: A polar pattern in such a case has a narrow pickup pattern and also picks up sound from the rear which is shown in Figure 6.6(C). It has null points at 110° and 250° and roughly a 12 dB decline in sensitivity at its sides, that is, 90° and 270°, and a rear lobe of sensitivity with 6 dB less sensitivity, that is, at 180°compared to the on-axis (0°). It is most commonly used in loud-sound scenarios. It has better isolation, and its feedback resistance is higher than the cardioid’s.11
D. Supercardioid: The polar pattern in this case is similar to that of a hypercardioid microphone but with a compressed rear pickup. It has null points at 233° and 127° and roughly a 10 dB decline in sensitivity at its sides, that is, 90° and 270° and a rear lobe of sensitivity which is 10 dB less, that is, at 180° compared to the on-axis (0°).11 Such a polar pattern is shown in the Figure 6.6(D).
E. Figure-8 or Bi-directional: The name itself describes the characteristics that it picks up sound from two directions, that is, front (0°) and behind (180°) the microphone forming a shape of the digit 8 and, hence, the name Figure-8. The polar pattern is shown in Figure 6.6(E). It has null points at its sides, that is, at 90° and 270°.11
G. Lobar/Shotgun: This is an extended version of the supercardioid and hypercardioid polar patterns. It has a tighter polar pattern up front with a longer pickup range which is shown in Figure 6.6(G). It is more directional than hypercardioid. Hence, shotgun is used in filmmaking and theatre. They also make great overhead mics for capturing sounds like singing groups, chorals and drum cymbals.11
H. Boundary/PZM Hemispherical: This is a polar pattern that lies on the outer liner and pressure zero microphone (PZM) which is shown in Figure 6.6(H). This type of pattern is obtained by keeping the microphone capsule flush on a surface and within an acoustic space, placing the microphone itself on the boundary.11
Depending on the applications, there are different types of microphones which are described below.
A. Liquid microphone: Alexander Graham Bell and Thomas Watson invented the liquid microphone. It was the first of the working microphones to be developed. It consists of water and sulphuric acid in a metal cup. A cup is placed on the diaphragm with a needle at the end of the receiving diaphragm. When a sound wave strikes the needle, it moves towards the water. A small electrical current passes through the needle, which is regulated by sound vibrations. The liquid microphone was never a particularly functional or efficient device, but it helped in the development of other microphones.12
B. Carbon microphone: Carbon microphone is the oldest kind of microphone. It has a thin metal or plastic diaphragm on one side and uses carbon dust. When a sound wave hits the diaphragm, the carbon dust gets compressed which alters its resistance and results in a flow of current. This technology was used in telephones as well. It is mostly used in the chemical manufacturing industry and mining because higher line voltages might cause explosions.12
C. Fiber optic microphone: It uses super-thin strands of glass rather than metallic wires to transfer the audio signals. It converts acoustic signals into light signals. Since there is no generation of electrical signals in the microphone or optic fiber cables, the fiber optic microphone gives secure and interference-free communication in electrically or chemically hazardous conditions.12
D. Electret microphone: It is a microphone with a condenser and an electrostatically working capacitor. Electret material is any dielectric material that maintains its electric polarisation after being subjected to a strong electric field. The microphone consists of a light, moving diaphragm, stationary backplate and electret material; these materials produce constant external and internal electric fields and can productively charge other electrical elements, such as capacitors and provide polarizing voltage. When sound waves hit the diaphragm, it results in a capacitance between the diaphragm and the backplate. It induces an AC voltage on the backplate.12
E. Laser microphone: In this type of microphone, a laser beam directed into a room through a gap of a window, reflects off the objects; the reflected beam is converted into an acoustic/audio signal by a receiver. As vibrations shift the surface of the vibrating object, the reflection of the laser is deflected. The receiver will find the laser deflections due to the vibrations that were originally created from an audio signal. Therefore, a receiver takes in the oscillating laser signal from a constant/fixed location. The receiver can then filter and amplify this beam signal and produce the audio output. Through this process the laser microphone successfully reproduces the audio that causes the object’s vibrations.12
F. Crystal microphone: This microphone works on the piezoelectric effect. It produces the voltage when crystals with piezoelectric effect are deformed. The diaphragm is attached to a thin strip of piezoelectric material. When the crystal is deflected by the diaphragm, the two sides of the crystal gain opposite charges. The charges are proportional to the amount of deformation and disappear when the stress on the crystal disappears. Because of its high output, Rochelle salt was used in early crystal microphones but it is sensitive to moisture and fragile. Later microphones used ceramic materials such as titanate, lead zirconate and barium. The electric output of crystal microphones is comparatively large but they are not considered seriously in the music market because the frequency response is not comparable to a good dynamic microphone.12
G. Wireless microphone: Nowadays wireless microphones are a necessity. Both professionals and non- professionals use wireless microphones for various activities like live performances, concerts, classroom presentations and so on. Wireless microphones don’t require wires or cables to connect the microphone to the sound system. They transmit sound via wireless channels. They are capable of delivering high-quality sound. Based on its purpose, a wireless microphone system is divided into two types – professional and consumer. Professional systems provide high-quality audio These are used in broadcast systems, live performance and so on. On the other hand, consumer systems are used in headphones, toys and so on.
H. MEMS microphone: Microelectromechanical systems (MEMS) is an emerging technology in all the fields of engineering such as automobiles, aerospace technology, biomedical applications, inkjet printers, and wireless and optical communications. The technology integrates electronics and mechanical properties of the transducer on a single chip both to make a miniaturised structure at less cost. The size varies from a millionth of a micrometre to a thousandth of a millimetre. Due to its widespread application, this microphone has an increasing demand in the market. The materials used include ceramics, semiconductors, plastics, magnetic, ferroelectric, and biomaterials. A typical MEMS microphone is shown in Figure 6.6(H). The MEMS microphone is built on a printed circuit board (PCB) with MEMS components such as semiconductors, microactuators, microsensors and so on. In the final stage, it is protected by a mechanical cover. It uses capacitive technology, that is, the MEMS diaphragm forms a capacitor and sound wave pressure causes the diaphragm to move. It consists of an audio preamplifier which converts the changing capacitance of the MEMS to an electrical signal.12
I. USB microphone: A Universal Serial Bus (USB) microphone consists of a transducer which converts sound signal into analog audio signals. It has USB output implying that the output is digital, that is, a digital audio signal. The USB microphone consists of an analog-to-digital converter to convert the analog signals from its transducer element into digital signals. It has a built-in digital audio interface which connects directly to a PC or computer via USB connection/cable and hence the name USB microphone.13
J. Shotgun microphone: This microphone is also known as a rifle mic because like a shotgun, it points directly at its target source (person, instruments) in order to pick up the sound effectively. It works on the principle of ‘waveform interference’. The slots in the tube result in the interference and hence it is also called an ‘Interference Tube’. This microphone is unidirectional; it picks up the sound from the target direction with high gain when the shotgun microphone is in front of the subject/instrument or any other source.14
K. Hydrophone: In 1929 Reginald Fesseden invented the hydrophone. The hydrophone was earlier known as the Fesseden oscillator. It is a device that resembles a microphone in the way it works. It converts underwater sound waves into electrical signals but it is mainly used for detecting underwater acoustics waves such as those created by submarines. Most hydrophones are based on the piezoelectric effect. For better detection, an array of hydrophones is used instead of a single hydrophone. Basically, there are two types of hydrophones: (1) omnidirectional hydrophone which detects sound in all directions with equal sensitivity, and (2) directional hydrophone which has higher sensitivity to a particular direction and detects directional acoustic signals.15 For example, a hydrophone was placed as a part of a deep ocean instrument package at a depth of more than 10,971 metres (6.71 miles). It continuously recorded the ambient sound levels of the deep ocean with a frequency ranging from 10 Hz to 32,000 Hz over 23 days.16
L. Eigen mic/mike: One of the most common Eigen mics is the em32 Eigen mike, a microphone array comprising many professional quality microphones placed on a rigid surface of a sphere. It has a two-step process. (1) Using digital signal processing the outputs of the separate microphones are fused to create a set of Eigenbeams. The sound field is captured by this complete set of Eigenbeams. The capture is limited by the spatial order of the beamformer. (2) The Eigenbeams are fused to steer numerous concurrent beam patterns that can be focused in particular directions in the acoustic field. Eigen mikes are used in real-time applications, multichannel surround sound recording and playback, spatially realistic teleconferencing and sound field spatial analysis. Other applications include sound production for music/film/broadcast, consumer products and gaming, security and surveillance, news or sports reporting.17
M. Sound level meter /Dosimeter: A sound level meter (SLM), also known as a sound pressure level (SPL) meter, noise dosimeter, noise meter or decibel (dB) meter, is used to measure the noise/sound levels by measuring sound pressure. Acoustic measurement values are shown on the display of the sound level meter. An SLM is mainly used to measure and manage noise from a variety of sources, such as industrial/factory, rail and road traffic, building construction work and so on.18
N. Lavalier microphone: It is a type of microphone used in television interviews, conferences, public speaking and so on which provides hands-free operation and offers uniform and clear sound without background noise. It is a wearable mic and usually people attach it to their tie, collar or lapel. It can be either wired or wireless. It is also referred to as a lapel mic, lav mic, collar mic, body mic, clip mic, personal mic and neck mic.19
A microphone array is a device consisting of two or more microphones integrated over a single circuit board that works like a normal microphone. It is used in the applications of acoustic signal processing technologies such as in Automated Speech Recognition, beamforming, speech signal separation, noise reduction and so on.20 The microphone may be a condenser, dynamic or ribbon.
There are three basic geometries in the microphone array: linear, planar and three dimensional (3-D). To create a microphone array, the frequency range and x-y-z coordinates of each microphone are required.21 Depending on the target application, the microphones in a microphone array are arranged in a linear, circular or triangular manner, as illustrated in Figure 6.7.22 There are a variety of other arrangements of microphone as well specific to different applications. Microphones in a microphone array record sounds simultaneously. However, it is important that the characteristics of all the microphones in the array are matched. The following three aspects are considered while selecting microphones in an array.
1. Directionality: The directionality of a microphone is the direction from which it can pick up sounds. All the microphone must have the same directionality while building a microphone array. Having a single microphone which picks up the sound from a particular direction other than picking up the sound from all directions causes imbalance and leads to disastrous sound recording.
2. Sensitivity: Sensitivity is the gain that a microphone picks up while recording an audio signal. It must be matched across all microphones in an array otherwise one microphone will be louder than the others, creating an imbalance in sound recording. Therefore, a sensitivity difference of ±1.5dB to 3dB is allowed in microphone arrays.
3. Phase: Phase means the time that every microphone in an array starts and stops recording the signal. Signals recorded at different times when the microphones have different phases results in unsynchronised and undesired recordings.20 Hence, it is recommended that all microphones in an array be phase matched.
If the microphones in an array are dissimilar, problems such as variation of gain, phase and sensitivity will occur leading to poor quality in audio applications. Hence, microphones must have similar or closely matched microphones to meet certain specifications to avoid uneven sound recordings.20
1. Noise detection and measurement techniques: Noise can be detected and measured using beamforming techniques. A microphone phased array is used to measure the far field noise, that is, in the open-jet and hybrid test section. Spectra are determined by using the source power integration technique of conventional beamforming maps. An experiment was carried out in the University of Twente in a closed test section, where wall-mounted microphones were used to reduce the self-noise and to improve the signal-to-noise.23
2. Audio enhancement: While capturing audio, nearby noise can often ruin a recording. The recording must be digitally altered to enhance the sound, remove noise and save only the desired audio. In order to distinguish which parts of a recording are from the desired source, it is often necessary to take multiple recordings from different locations and compare the results. The source of the desired audio and each source of noise can be identified later so that only the unwanted sound is removed.24
3. Spatial audio object capture: This refers to the collection and protection of spatial information of an acoustics from multichannel or stereophonic sound recording and reproduction. Headphones or multiple loudspeakers are employed to permit the listener to distinguish the directions of each sound source, keeping the original sound scene. It provides a more realistic sensation and experience in the gaming field from multichannel audio. Furthermore, teleconferencing applications use spatial audio to develop an immersive and natural communication between two or more subjects.25
4. Enhancing the recording quality: The differences in time lag between recordings can also be used to differentiate different sounds being heard by the microphone array. Each microphone picks up the sources of sound with varying delays and volumes. By comparing the differences in sound content among the microphone recordings, specific sounds can be isolated and amplified or removed. Unwanted sounds can be strategically removed, almost entirely, and the enhanced audio to be processed can be more clearly and accurately analysed.24
5. Separation of multiple speakers talking: Identifying and enhancing non-stationary speech targets in various noise environments, such as a conference, classroom meeting, cocktail party and so on, is a significant issue for real-time separation of speech from multiple speakers. By using microphone arrays and beamforming technique, separation of multiple speakers can be achieved.26
It is used to eliminate certain kinds of noise during recording. Sudden air pressure causes the microphone to get overloaded due to plosives (/p/ /t/ /k/ /b/ /d/ /g/). In order to reduce overloading a pop filter is used. It also acts as a protective shield from vocalists’ saliva. Saliva is corrosive in nature so a pop filter provides longer life to the microphone. Pop filters, as shown in Figure 6.8, are connected and placed in front of the microphone to eliminate noise due to fast-moving air.27
Sound is measured in terms of its intensity/pressure level. The unit of measurement is decibel, denoted by dB. Basically ‘deci’ means ten and ‘B’ denotes Bell. This means decibel uses base ten logarithms. The applications of dB are widespread in scientific and engineering areas, especially within electronics, acoustics and control theory.28 Devices such as a decibel meter (as shown in Figure 6.9) or audiometer are used to measure the level or intensity of sound. Table 6.1 presents examples of sound intensity.
|Sound intensity level β in (dB)||Intensity I in (W/m2)||Example|
|0||1 × 10–12||Threshold of hearing at 1000 Hz|
|10||1 × 10–11||Rustling of leaves|
|50||1 × 10–7||Soft music|
|60||1 × 10–6||Normal talk / conversation|
|70||1 × 10–5||Busy traffic, noisy area|
|80||1 × 10–4||Loud noise|
|100||1 × 10–2||Factory siren at 30 m|
|140||1 × 102||Jet aeroplane at 30 m|
|160||1 × 104||Bursting of eardrums|
The need for measuring sounds with various intensities and time frequencies led to the development of a variety of microphones. The medium of recording, size of the device, sound quality and target applications also influence the variety of microphones. An array of such measuring devices is also used due to its capacity to remove unwanted sounds when recording in open spaces.
Depending on the specifications, various sound measuring devices are available in the market at different prices. The graph in Figure 6.10 illustrates the different types of sound measuring devices with their cost. The cost of the devices is taken from various online marketing/shopping websites and may vary according to location/offers/availability. It is clear that a stethoscope, used for listening to internal body sounds like the heart, lungs and so on for clinical purposes, is relatively cheap. However, devices like Eigen microphones that capture spatial sounds are the most expensive among the ones shown in Figure 6.10. It is also worth noting that the price of every device depends on its manufacturer and its applications. Thus, a user needs to decide which measuring device he/she would like to purchase depending on all these factors.
While the choice of the right device plays an important role in measuring sound, it is also equally important to pay attention to how the recorded sound samples are stored for further use, analysis and processing. For example, a magnetic tape or a hard disk drive (HDD) in a computer can be used to store sound. While magnetic tape has been used in the last few decades to store recorded sound, HDD, particularly solid state disk (SSD) drives are more widely used at present. These come in various forms, namely, pendrives, SD cards, SATA drives and so on. Data from such a storage medium needs to be retrieved for further processing of recorded sounds. Thus, storage and retrieval of data is an integral part of sound measurement. A large number of engineering solutions exist for such data transfer and storage.30
Devices for recording various sounds have seen drastic changes in design, quality, characteristics and price. In 1827, the word ‘microphone’ was coined by Sir Charles Wheatstone. In the 19th century interest in and development of the microphone increased and in 1916 the condenser microphone was patented by E. C. Wente. Radio broadcast became the foremost source of entertainment for everyone. Hence, scientists started working more on the microphone and discovered the ribbon microphone which suited radio broadcasting. In the 21st century researchers have worked more on MEMS, Eigen mike and array microphones. In 2010 the Eigen mike came to the market and played a vital role in capturing sound from different directions. Microphone arrays have been ruling the audio industry because of the dynamic surround sound recording that they allow. MEMS microphones focus mainly on miniature and portable devices including headsets, cell phones and laptops. There has been a recent demand for sound measuring devices for smart wearable, smart homes/buildings and automobile technology which may also lead researchers to discover and develop various microphones in the coming days.
This chapter intends to describe the principles involved in quantifying the effects of noise and music on health and well-being. The main principle of quantification is defining and operationalizing the variables. The independent variables (variables that cause the impact) may be classified as those related to characteristics of noise and music. The other set of dependent variables are measures of impact on health and well-being. We will create a comprehensive and unified template to guide practitioners and researchers to quantify the impact of noise and music on health and well-being. The variables can be categorized as physical and affective domains. Physical attributes can be directly measured, whereas affective attributes are evaluated using surrogate measures. Hearing thresholds is a physical variable that can be measured using auditory brainstem response. Annoyance to high noise is an affective variable that is measured indirectly by estimating performance levels. The list of variables associated with noise, music and health is enumerated. The method to standardize and operationalize the variables will be explained. The concept of bias and strategies to limit the errors are described. The steps of developing a validated measurement tool are also explained. The last section deals with application of artificial intelligence in quantifying the impact of noise and music on health and well-being.
1. Method to operationalize dependent and independent variables in the physical and affective domains relevant to noise, music and health
2. Describe bias and strategies to limit errors due to bias
3. Steps to develop a psychometrically validated measurement tool
4. Application of machine learning and artificial intelligence in this context
Measurement is a precise activity. The variables that need to be measured can be of two types: those in the physical domain and those in the affective domain. Tangible variables are called physical variables. For example, blood pressure has an absolute zero, so it is a variable in the physical domain. Motivation is a variable that is measured indirectly based on behavioural output. There are scales to measure motivation, but there is no absolute zero. This important principle has to be applied while operationalizing to measure the impact of noise and music on health. The main steps to operationalize a variable are defining the variable, deciding the tool to measure, identify sources of bias and plan strategies to limit them and finally summarize using descriptive and inferential statistics. The last two steps namely descriptive and inferential statistics have been explained in Chapter 22 on framework to conduct research on sound and health.1 The reader may refer to Chapter 22 for further details. All the variables commonly described in literature with relevance to noise, music and health will be defined and method to operationalize are described in the following sections.
Loudness of sound (independent variable): The intensity of sound or music is called the loudness of sound. Decibel is the physical unit of measurement. The reader can refer to Chapter 6 on measurement of sound for more details on all aspects of sound measurement. Phon is the perceptual equivalent for loudness.
Dosage of sound (independent variable): The total sound energy that a person is exposed to is defined as the dosage. L equivalent (Leq) is the physical unit to measure this variable. An integrating sound level meter is used to measure the dose of exposure.
Frequency characteristics of sound (independent variable): The frequency characteristics of sound or music is measured in hertz (number of cycles per second). Pitch is the perceptual equivalent of frequency. Pure tones are sounds with a single frequency. They are easy to quantify. Complex sounds and music have a spectrum of amplitudes and frequencies which vary with time. Quantifying them is a complex task. A process called fast Fourier transform (FFT) is used to quantify these variables. The reader may refer to Chapter 3 on physics of sound to understand FFT. Timbre is the perceptual measure of these complex sounds.
Quality of sleep (dependent variable): The quality of sleep is measured based on sleep latency (time to sleep after you go to bed which is usually 30 minutes), sleep waking (frequency of getting up during sleep), wakefulness (duration of time spent awake after you first go to sleep, usually limited to 20 minutes) and sleep efficiency (amount of time actually spent in sleep after going to bed, which is usually 85% of the total time). Sleep studies are an objective method to measure the quality of sleep. If done in the home setting, they are more representative than hospital settings.2
Neural plasticity (dependent variable): This is defined as the capacity of the nervous system to modify itself functionally and structurally in response to experience and injury. This can be measured in real time using functional MRI (Magnetic resonance imaging).3
Biomarkers (dependent variable): Various biochemical markers associated with noise and music are dopamine, serotonin, endorphin, cortisol, oxytocin, leukocytes, cytokinin, salivary immunoglobulin, interleukins, tumour necrosis factor, testosterone, brain derived neurotrophic factor, free radicals, NADPH oxidase, nitric oxide synthase, renal angiotensin aldosterone, catecholamine, kinins, serotonin, histamine, cholesterol, blood sugar, blood viscosity, coagulation parameters and cortisol. All these are biochemical parameters. The laboratories where they are estimated should calibrate their equipment to ensure accurate measurements. Many of these parameters have a diurnal variation. This has to be factored and all measurements should be done at a pre-fixed time of the day.
Blood pressure and heart rate (dependent variable): These variables have to be measured by automated and calibrated equipment to limit intra– and inter–observer variability.
Vertigo (dependent variable): Vertigo is a sense of rotation when the body is in reality static. It can be quantified using electronystagmogram or videonystagmogram. These equipment measure the degree of nystagmus (jerky eye movements) and qualify the type of nystagmus.
Auditory threshold shifts (dependent variable): Hearing threshold shifts are measured using audiometry in a sound treated room. The threshold for hearing in each frequency ranging from 500 Hz to 8 KHz (Hz is hertz, a measure of sound frequency). Audiometry is a subjective evaluation method as the person responds to the sound heard through the ear phones. Auditory brainstem response, where sound is presented to the ear and electrical response from the brainstem is measured from electrodes placed on the scalp is an objective measure of hearing thresholds.
Atherosclerosis (dependent variable): Thickening of blood vessels is termed as atherosclerosis. Digital subtraction angiography where a radio-opaque dye is injected into the blood vessels is the most reliable measure of this thickening.
Histopathological changes in the inner ear (dependent variable): It is not possible to evaluate the histopathological (microscopic) changes in the inner ear in a living person exposed to noise. Usually, after consent, the inner ear is harvested after the person dies. There are repositories where the inner ear specimens are harvested and examined to understand the pathological basis of inner ear damage in those exposed to loud noise.
Oxidative stress (dependent variable): This is defined as the imbalance in production of free radicals and antioxidants in the body. The free radicals can damage the tissues. Precise biochemical assays are used to measure the imbalance and detect excess of free radicals.
Healthy life years (dependent variable): This is defined as the number of years the person is expected to live a healthy life. It is a health utility measure. It is used to measure burden of disease.4
Productivity (dependent variable): This is measured by the output at the end of particular time period for an assigned activity.
Verbal fluency (dependent variable): Verbal fluency is measured using language specific scales.
Fatigability (dependent variable): This measures perception of fatigue by a person. Based on the activity being measured, the person involved is asked to respond.
Interpersonal relationship (dependent variable): This is a very complex area with multiple facets. There are well-established scales to measure each aspect of this domain. They are broadly classified as relationship between family members, friends, colleagues and romantic partners.
Classroom learning (dependent variable): This measures scholastic performance in the classroom. Marks obtained in the tests and examinations are the best method to evaluate classroom learning. Scales to measure specific aspects of learning can be designed for a particular context.
Reaction time (dependent variable): The time between stimulus and response is called reaction time. It can be measured best by video recording the sequence of events and measuring the time.
Variables in the affective domain are based on the perceptions of the person. There is no objective method to measure these domains. Psychometric scales are attempts to standardize these measurements. Method to develop these scales are described in the following sections.
Pain (dependent variable): Pain is defined as an unpleasant sensory and emotional experience associated with actual or potential tissue damage. Visual analogue scales are the most appropriate method to estimate pain perception.
Concentration (dependent variable): It is defined as the ability to focus your thoughts on the work being performed at that particular time. Measuring productivity is a surrogate measure of concentration.
Cognitive performance (dependent variable): Cognitive performance is based on various abilities like learning, thinking, reasoning, remembering, problem solving, decision making and attention. There are validated scales to measure each of these abilities and the researcher should collaborate with a clinical psychologist to measure cognitive performance.5
Motivation (dependent variable): Motivation is defined as the process that initiates, guides and maintains goal-oriented behaviours. Productivity is a surrogate measure of motivation.6
Endurance (dependent variable): The ability to withstand a difficult situation is defined as endurance. Measuring productivity in adverse situations is a measure of endurance.
Creativity and creative thinking (dependent variable): The ability to generate innovative ideas that solve problems is called creativity. Solving complex problems is an appropriate measure of creative thinking.
Emotional regulation (dependent variable): The ability of a person to appropriately manage emotions in challenging situations is defined as emotional regulation. It consists of attentional control, cognitive reappraisal and response modulation. Productivity in challenging situations is an indirect measure of emotional regulation.
Socio-cultural bonding (dependent variable): The ability to develop powerful and long-lasting relationships with peers and colleagues is socio-cultural bonding. It is a very complex domain and challenging to measure. Nevertheless, number of friends and feedback from peers is an effective method to measure this aspect.
Relaxation and stress relief (dependent variable): Methods to reduce stress is defined as relaxation. A pre–post evaluation using a scale to measure stress is a good method to measure relaxation.
Mood (dependent variable): The state of mind is defined as mood. There are eight primary moods namely anger, sadness, fear, joy, interest, surprise, disgust and shame. In each of these domains a spectrum of emotions exists. Various scales are used to measure the emotions but they are very subjective and standardization is challenging. Visual scales or numerical ranges are the best to evaluate each emotion.
Tinnitus (dependent variable): Hearing sounds in the absence of any external sounds is defined as tinnitus. It is a very subjective feeling. In certain other conditions it may be an objective phenomenon. It can be measured using various scales. The researcher should collaborate with otolaryngologists (ENT specialists) and audiologists (Hearing evaluators) to quantify this variable.
Annoyance (dependent variable): Annoyance is a derivative of the primary emotion of disgust. Various scales exist to measure annoyance and the researcher should collaborate with a clinical psychologist to measure annoyance.
Intelligence (dependent variable): The ability to think effectively and solve problems efficiently is a measure of intelligence. There are many theories of intelligence that put forth varying facets of intelligence. Based on each theory, a scale to measure it exists. It is critical that the cultural context is taken into account to choose the items used to measure intelligence. A clinical psychologist is an important specialist required to measure intelligence.
After defining and operationalizing the independent and dependent variable, the associated variables that can create erroneous results have to be identified. Then the strategy to limit these errors have to be selected. If we are evaluating the effect of noise on stress levels then personality, age and prior exposure to loud noise are possible variables that can bias the results. Brainstorming by experts and reviewing literature is a method to identify these variables. Various strategies to limit these errors are explained in the following sections.7, 8
Bias: Bias is understood as systematic variation of measurements from their true values that may be intentional or unintentional. A well-defined research protocol is the most appropriate method to limit bias.
Chance: Random variations without obvious relation to other measurements or variables is chance error. It is usually intentional. Having a control group is best method to limit chance error.
Natural history: The natural course of a disease may cause error while measuring the results. Here again a control group limits this error.
Regression to mean: Improvement of symptoms irrespective of the intervention due to natural healing is called regression to mean. Having a control group limits this error.
Placebo effect: In interventional studies even the placebo can cause an effect due to expectation that the treatment will work. This type of error can be limited by employing a control group.
Halo effect: In certain situations the attention and care of the healthcare provider is therapeutic. This is called halo effect. Here again a control group will assist in limiting the error.
Confounding: The results can be distorted by other unknown factors beyond our control. This phenomenon is known as confounding. Randomization or adjustment by multivariate analysis is the best method to limit this type of error.
Allocation (susceptibility) bias: At times the more favourable cases are allocated to the intervention group, which is called allocation bias. Here again, randomization and adjustments by multivariate analysis assists to limit error.
Ascertainment (detection) bias: During analysis, the researcher may at times round up the numbers to favour the treatment group. This type of bias is called ascertainment bias. Masking (blinding) during outcome analysis is employed to limit this type of error.
In the previous section, we have classified the variables under physical domain and affective domain. This section will explain the method to develop a standardized measurement tool for variables in the affective domain.9,10 Figure 7.1 depicts the process of developing a standardized tool for measuring an effective domain. The first step is to identify the challenging situation in the area of noise – music – health where a measurement tool needs to be designed. To further illustrate this critical step we will take an example. It is a known fact that persons working in noisy environment do not comply with wearing ear protection devices. If we could device a tool to measure the various factors that constitute the perception of comfort, then we can use it to predict the utilization of these devices. So the target populations are persons working in noisy environments who are prescribed ear protection devices. The action we intend to accomplish is adequate compliance with ear protection devices. The next step is to define the domains and facets of the tool. For this, we need to visit the site where the workers are employed. An ethnographic observation of the situation along with in-depth interviews and focus group discussions have to be performed to understand the various facets that constitute perception of comfort while wearing ear protection devices. Ethnography is structured observations to understand the real time situation. In-depth interviews and focus group discussions are one to one and group conversations to understand views, perceptions and experiences of wearing ear protection devices. Based on the responses, a set of affective constructs that constitute comfort perception is listed. Then questions or probes to elicit responses for each facet or domain are created. Each item (question) should be simple, single barreled, culturally appropriate and clearly address the domain. At least 100 items should be created for the first draft of the tool. The responses can be in the form of increasing or decreasing degrees of agreement (strongly agree to strongly disagree). Visual scales in the form of numbered lines are another method to elicit responses. There should be at least 5 grades of responses. Try avoiding the choice ‘neither agree nor disagree’ to avoid neutrality. Following this, stakeholders are requested to evaluate the items for content and face validity. Content validity ensures that the item elicits information relevant to all attributes of the domain being evaluated. Face validity ensures that the item resembles the domain being evaluated. The stakeholders, rate the items on a scale of relevant to irrelevant. Items that score less than 50% validity are discarded. The next step is to reduce the number of items. Construct and criterion validity is the technique used to reduce the number of items. Construct validity measures ability of the items to distinguish two different groups. Criterion validity is based on the performance of the items when compared with reference criteria. For example, construct validity for scale measuring comfort levels while wearing ear protection devices should clearly distinguish between those who are perceiving comfort from those in discomfort. Criterion validity, measures the tool’s score with pressure exerted by the ear protection device on the ear canal. A statistical method called principal factor analysis is employed for identifying the items that are measuring a similar domain and those that measure diverse domains. Statisticians will assist in performing these tests. The reader may read advanced texts on tool development if they are actually planning to construct a standardized tool. Then the second version of the tool is ready for pilot testing on persons in noisy environment who are prescribed ear protection devices. The first testing gives information for further reducing the number of items and also intra– and inter–observer reliability. The next level of testing is done in a large population and the final draft of the tool is created. User manuals are written. If the inventors intend to translate the tool to another language other than English, then an elaborate process of translation and back translation that will preserve the psychometric properties of the tool has to be followed. The components of ensuring equivalence are conceptual, semantic, technical, cultural and measurement equivalence.
Various ways of administering the tool are face to face interview, telephonic interview, mailed questionnaire method and computer assisted. Consensus based standards for the selection of health measurement in struments (COSMIN) is an internationally accepted standard to assess the quality of the measurement tool.11 After the tool is made available for use by researchers, at regular intervals the tool is further refined by the inventors.
Noise and music are complex signals. Quantifying and qualifying them is a challenging task. Measuring the effect of noise and music on health is surrounded by many confounding factors. Extracting the clear effect on health is possible only to the extent we are able to control or adjust the bias using strategies described in Section 7.3. Linear analysis has its limitations in these situations. With the advent of data mining and machine learning extracting signals in the presence of noise (here noise denotes confounding factors, not the noise as we understand in this book) has become more accurate. Artificial intelligence is an amalgamation of these principles. Artificial Neural Networks (ANN) can assist in these complex tasks. The fundamental premise on which neural networks are based, is the fact that humans solve problems by pattern recognition. The process of pattern recognition is by parallel and distributed processing that occurs in the neural networks of the human brain.12 The same principle is used to create ANN. The network is then trained to recognize patterns and arrive at solutions for a given problem. The characteristics of human intelligence are robustness and fault tolerance, flexibility, ability to recognize patterns that are fuzzy–probabilistic–noisy–inconsistent and parallel processing. While we develop artificial neural networks, these functional aspects need to be recreated. Further explanation of artificial intelligence is beyond the purview of this chapter. The reader is advised to refer advanced texts mentioned in the reference section.
Quantifying the effects of noise and music on health is a challenging task. The complexity of quantifying and qualifying noise and music is compounded by multiple confounders around measuring the effect on health. In this background the researcher should apply the following principles to quantify the effects of noise and music on health. The guiding principles are
1. Defining and operationalizing the dependent and independent variable
2. Identify the factors that can bias the results
3. State the strategy to limit various types of bias
4. Perform linear and parallel processing methods (employing artificial intelligence) for data analysis
5. State unadjusted (without limiting bias) and adjusted (after limiting bias) results for the readers to make their clinical decisions