SBIR-STTR Award

Vocalid SBIR Phase II: Optimized Speech Corpora for Personalized Speech Synthesis
Award last edited on: 3/13/2020

Sponsored Program
SBIR
Awarding Agency
NIH : NIDCD
Total Award Amount
$2,026,253
Award Phase
2
Solicitation Topic Code
-----

Principal Investigator
Rupal Patel

Company Information

VocaliD Inc

50 Leonard Street
Belmont, MA 02478
   (339) 368-0416
   hello@vocalid.co
   www.vocalid.ai
Location: Single
Congr. District: 05
County: Middlesex

Phase I

Contract Number: 1R43DC014607-01
Start Date: 6/1/2015    Completed: 11/30/2015
Phase I year
2015
Phase I Amount
$215,288
The human voice is a complex signal that conveys multiple aspects of one's identity including age, gender, ethnicity, size, and personality, among others. Yet, to date, users of augmentative and alternative communication (AAC) devices, screen reading technologies, and other text-to-speech applications have relied on a limited set of generic voices. VocaliD Inc. aims to create custom crafted synthetic voices that reflect the end-user by combining the recipient's residual vocal abilities with an anatomically similar donor's speech database. The resultant voice sounds like the recipient in age, personality and vocal identity but is as clear and understandable as the donor. We have successfully integrated our research prototype into several AAC devices and have three beta users currently using the technology. Family members and users attribute increased AAC device usage in educational and social settings as well as improved self-esteem and quality of life to this innovation. Our process to date, however, has relied on an onerous procedure to collect a sufficiently comprehensive corpus of donor speech and an ad hoc process to elicit speaker-identifying cues from recipients. This Phase I SBIR project aims to make VocaliD's personalized voice technology a viable option for millions of users by standardizing and optimizing the donor and recipient voice collection processes. These advances are critical to transforming this work from lab prototype into a commercial venture. Our innovation is grounded in the source-filter theory of speech production, which divides speech into a source component (the vocal folds) and a filter component (the rest of the vocal tract) that are largely independent. Empirical evidence suggests that despite impaired filter modulation, individuals with speech impairment have residual control over source characteristics. Since source and filter characteristics both contribute to speaker identity, our key challenge is to extract as much identity information from limited amount of recipient vocalizations as possible and combine this with the speech clarity information from donor voices so as to create an authentic yet understandable transformed voice. Thus, this Phase I has two specific aims: 1) to determine the optimal number and composition of stimuli recorded by donors that will result in a sufficiently intelligible and naturl sounding concatenative synthesis voice, and 2) to determine a set of speaker identity cues that can be extracted from sparse vocalization samples produced by voice recipients. Our ultimate goal is to produce resultant voices that are acoustically and perceptually identified as belonging to the recipients. In the United States alone, there are over 2.5 million AAC users who need to be heard in their own voices; an additional 3-5 million individuals with visual impairment who could benefit from a personalized screen reader especially when composing written text; and several hundred million devices and applications in the `internet of things' that enable us to access information, communicate and interact via speech. VocaliD has the potential to give the gift of voice to all those who need and want it to enhance how they learn, work and play.

Public Health Relevance Statement:


Public Health Relevance:
VocaliD Inc. aims to create custom crafted synthetic voices that reflect the end-user by combining the recipient's residual vocal abilities with an anatomicall similar donor's speech database. The resultant voice sounds like the recipient in age, personality and vocal identity but is as clear and understandable as the donor. This Phase I SBIR project aims to make VocaliD's personalized voice technology a viable option for millions of users by standardizing and optimizing the donor and recipient voice collection processes.

Project Terms:
Access to Information; Age; alternative communication; Augmentative and Alternative Communication device; base; Characteristics; Classification; Collection; commercialization; Communication; Communication Aids for Disabled; Complex; Cues; Custom; Databases; design; Development; Devices; empowered; Equipment and supply inventories; Ethnicity aspects; Family member; Frequencies (time pattern); Gender; Generic Drugs; Gift Giving; Goals; Hearing; Human; Impairment; improved; Individual; innovation; insight; Intellectual Property; Internet; Joints; Language; Learning; Linguistics; Mainstreaming (Education); Mining; novel; Personality; Phase; Phonetics; Play; Procedures; Process; Production; Protocols documentation; prototype; public health relevance; Quality of life; Reader; Reading; Relative (related person); Research; research study; Residual state; Rest; Sampling; self esteem; Self-Help Devices; Series; Signal Transduction; Small Business Innovation Research Grant; social; sound; Source; Speech; Stimulus; Target Populations; Technology; Text; theories; United States; Visual impairment; vocal cord; vocalization; Voice; Work; Writing

Phase II

Contract Number: 2R44DC014607-02
Start Date: 00/00/00    Completed: 00/00/00
Phase II year
2017
(last award dollars: 2019)
Phase II Amount
$1,810,965

Our voices are not identical, they are our identities. The human voice is a powerful signal that conveys one's age, gender, size, ethnicity, and personality, among other attributes. Yet, until now, users of augmentative and alternative communication (AAC) devices, screen reading technologies and other text-to-speech (TTS) applications have relied on a limited set of mass-produced, generic-sounding synthetic voices. This mismatch in vocal identity impacts educational outcomes, infringes on personal safety, and hinders social integration. Conventional methods for building a synthetic voice require a voice actor to record an extensive dataset of studio-quality recordings which are used to train a computational model and generate the output voice. The process is time and labor intensive and thus inaccessible to everyday consumers let alone those with speech impairment. VocaliD Inc's award winning technology offers an unprecedented means to build custom crafted synthetic voices that reflect the recipient by combining his/her own residual vocalizations with recordings of a matched speaker from our Human Voicebank. We have discovered that even a single vowel contains enough "vocal DNA" to seed the personalization process. VocaliD's custom voice sounds like the recipient in age, personality and vocal identity but is as clear and understandable as the donor's recordings. To create an affordable and efficient method of voice personalization, we leverage the penetration of high quality microphones and recording software on consumer grade computers and increased technological literacy to crowdsource the collection of speech and voice recordings. This enables engagement across broad age, socioeconomic, cultural and linguistic groups in order to truly sample the diversity of the human voice. The challenges, however, are to ensure high quality recordings and to sufficiently engage speech donors to complete the recording corpus. This Phase II project builds upon our success in Phase I to reduce the length of the donor corpus and to streamline and automate the recipient protocol. Results of our perceptual experiments indicated that while we were able to reduce the length of the donor corpus by 70%, it came at the cost of reduced intelligibility and naturalness. Since voice quality is vital to acceptance and adoption of our voices, this Phase II proposal is aimed at improving the clarity and expressiveness of our voices while maintaining the optimized corpus length. We propose to improve TTS intelligibility by developing methods to mitigate the effects of background noise and reverberation during donor and recipient recordings and aligning expected and actual spoken transcripts to reduce errors in TTS model building (Aim 1). To address the issue of TTS naturalness, we propose to modify the donor corpus to include more prosodically diverse contrasts and adapt the donor protocol to elicit natural melodic intonation and phrasing (Aim 2). These advances will yield a scalable and cost-effective method of personalized voice creation that will humanize speech-enabled technologies for AAC and beyond.

Public Health Relevance Statement:
VocaliD's breakthrough technology powers the first-ever custom synthetic voices that are made using only a brief sample of the recipient's residual voice combined with recordings of a matched speaker from a crowdsourced voicebank. This Phase II SBIR proposal addresses the challenge of creating a scalable and affordable method for achieving high quality, natural sounding, personalized voices from sparse and `non-laboratory grade' recipient and speech donor samples.

Project Terms:
Acoustics; Address; Adoption; Age; Augmentative and Alternative Communication; Award; base; Collection; Communication; communication device; Complex; Computer Simulation; Computer software; Computers; cost; cost effective; crowdsourcing; Custom; Data; Data Collection; Data Set; design; Development; DNA; Dreams; Ensure; Environment; Ethnic Origin; experimental study; Family member; Gender; Generations; Generic Drugs; girls; Human; Impairment; improved; Individual; Length; Limb Prosthesis; Linguistics; literacy; man; Manuals; Mediating; Methods; model building; Modeling; Noise; Outcome; Output; Penetration; Personality; Phase; phrases; Process; Prosthesis; Protocols documentation; Reading; Residual state; Safety; Sampling; Seeds; Series; Signal Transduction; Small Business Innovation Research Grant; social integration; socioeconomics; sound; Speech; Speech Intelligibility; speech processing; success; Technology; Text; Time; Training; Transcript; Variant; vocalization; Voice; Voice Quality