Transcribe Japanese Audio to Text

日本語 to properly written text — kanji, kana, and punctuation handled. Free, no account needed.

No sign-up No watermark TXT · SRT · VTT exports Files auto-delete in 24h

Drag & drop your file here

or browse your files

Free: {0} files a day · up to {1} min & {2} MB each

Japanese (日本語) Language pre-set for this page

Bigger files or more uploads? Free account: 5 files/day, 1-hour files · Pro: 10-hour files + speaker labels

Kanji + kana output標準語Kansai-benBusiness Japanese

How Japanese comes out: script and segmentation

The transcript is written the way Japanese is actually written: a natural mix of kanji, hiragana, and katakana, chosen by context — 会議 not かいぎ, katakana for loanwords and foreign names, kana for grammar. Japanese punctuation (、and 。) is inserted, and since Japanese doesn't use spaces, the text flows unsegmented, exactly as a native writer would produce it.

One consequence for subtitles: cue lengths are based on the model's phrase segmentation rather than word counts. The SRT/VTT exports produce short, readable cues, but Japanese subtitle conventions (13–16 characters per line) may mean you split a few cues in the editor for broadcast-style captioning.

Registers and dialects

Standard Japanese (標準語) — meetings, lectures, podcasts, keigo-heavy business speech — is the model's strong suit and transcribes at a high tier. Kansai-ben and other regional forms are usually rendered correctly or lightly normalized; strong regional dialect from older rural speakers (Tōhoku, Kagoshima) is the genuinely hard case. Homophone-heavy Japanese means the model occasionally picks the wrong kanji for a name — worth a scan in the editor.

Frequently asked questions

Does the transcript use kanji or kana?

Both, mixed naturally by context — standard written Japanese, not a romaji or all-kana rendering. Common words appear in kanji, grammatical elements in hiragana, loanwords and foreign names in katakana. Rare or ambiguous readings sometimes get the wrong kanji; those are quick edits.

How is segmentation handled without spaces?

Japanese text is written without spaces and the transcript follows that convention. Segments are split on natural phrase boundaries with 、and 。 punctuation inserted by the model, so the text reads like written Japanese, and SRT cues break at phrase edges rather than mid-word.

How accurate is Japanese compared to English?

High, but a notch below the top European languages — Japanese's pitch accent and massive homophone inventory make kanji selection the main error source rather than mishearing. Clear studio or meeting audio produces very usable transcripts; the editor pass is mostly homophone/kanji checks.

Kansai-ben and other dialects?

Kansai-ben is common enough in media that the model handles it well, usually writing dialect forms as spoken (あかん, ~へん negatives). Heavier regional dialects — rural Tōhoku, Okinawan-influenced speech — degrade accuracy noticeably. Standard-register speakers with regional accents are no problem.

Can I get English text from Japanese audio?

This page produces a Japanese transcript, which is the accurate path. For English, export the TXT and machine-translate it — Japanese→English translation quality is much higher on clean text than on direct speech translation. In-app translation is on the roadmap.

Will it handle keigo and business Japanese?

Yes — polite and honorific registers are heavily represented in training data (news, presentations, service interactions) and transcribe accurately, including humble/honorific verb forms. Casual speech with contractions (っす, じゃん) also comes out as spoken.