this post was submitted on 17 Jul 2025
40 points (100.0% liked)

Opensource

3404 readers
85 users here now

A community for discussion about open source software! Ask questions, share knowledge, share news, or post interesting stuff related to it!

CreditsIcon base by Lorc under CC BY 3.0 with modifications to add a gradient



founded 2 years ago
MODERATORS
 

From their newsletter:

We’re so excited to share that the 22nd dataset release for Common Voice is now available for download.

Common Voice 22.0 has an additional 281 hours of speech data, bringing the total number of hours to 33,815. This release has also seen a jump in 296 newly validated hours, with a total of 22,640 validated hours of clips. This release welcomes the addition of Aromanian (rup), Tajik (tg), and Venda/Tshivenda (ve) languages.

Aromanian is spoken by around 210,000 people in the Balkans, while Tajik is a language closely related to Persian spoken in Tajikistan and Uzbekistan by over 10 million people. Venda / Tshivenda is spoken by over 2 million people as a first or other language in South Africa and Zimbabwe.

This brings the total number of languages available in this Scripted Speech release to 137.

For those unfamiliar:

Common Voice is a crowdsourcing project started by Mozilla to create a free and open speech corpus. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0.[1] This license ensures that developers can use the database for voice-to-text and text-to-voice applications without restrictions or costs.

you are viewing a single comment's thread
view the rest of the comments
[–] Kissaki@programming.dev 12 points 1 day ago (2 children)
  • 44% Male/Masculine
  • 39% No information
  • 18% Female/Feminine

Tech bias even on public domain open contribution datasets. Apparently could use more female contributors.

[–] FundMECFS@quokk.au 5 points 1 day ago

Yeah that’s pretty bad, wonder what kind of other biases there are as well. Class, dialect etc…