2025.8.25

ポッドキャスト風の音声生成ツール「interview2jppodcast」を作ってみた

技術

こんにちは！AIサービス開発室の鈴木生雄です。「interview2jppodcast」という、海外のインタビューのトランスクリプトからポッドキャスト風の音声を生成するツールを作成しました。よろしければ、まずPR用のポッドキャストをお聴きください。

実はこのポッドキャストの音声もinterview2jppodcastを使って作成しています。以下のようなトランスクリプトを入力するだけで、2人の自然な掛け合いの音声が生成できます。特に指示しなくても話者分離するのと、指定したBGMを自動で合成してくれる点が特徴です。

This is "Tech & Life Cafe," and I'm your host, Takuya.
And I'm your co-host, Rena! Hey Takuya, I recently found an interesting interview video from overseas, but the English was so fast I could barely catch half of it...
Oh, I totally get that. Especially with technical topics, it's hard to even keep up with the subtitles. But I found an amazing app that solves that exact problem. It's called "interview2jppodcast"!
"Interview to Japanese Podcast"? So, just like the name says, it turns interviews into Japanese podcasts?
Exactly! With this app, as long as you have the transcript from an overseas interview, you can "listen" to it just like a Japanese radio show.
Wow, how does that work?
First, the amazing thing about this app is that it doesn't just translate the foreign transcript. It actually separates the voices for each speaker and turns it into a natural-sounding Japanese conversation.
Wow! It even does speaker separation? That's great, because with a mixed block of text, you'd lose track of who's talking.
That's right. And the one reading the Japanese translation is Google's latest AI, "Gemini." So it doesn't sound robotic at all; it's natural, like you're listening to real people talk.
Gemini! You can definitely expect high quality then. But for a long interview, like an hour, doesn't it take a long time to create the audio file?
Good question. That's another incredible feature of this app. It splits the long transcript into smaller "chunks" and processes them all at once in parallel with multiple AIs. It uses a technology called GraphAI, and thanks to that, even long interviews can be converted to audio in no time.
I see! So it's fast because it splits the work and does it all at once. That's smart!
And that's not all. It even has a feature to mix in your favorite BGM with the finished audio. So you can play your favorite music in the background and enjoy it just like a real podcast.
Amazing! So that means I can listen to that video I couldn't understand as a Japanese podcast during my commute!
Exactly! With "interview2jppodcast," you can absorb interesting information from around the world without worrying about language barriers. It makes the latest global information much more accessible.
That's groundbreaking! I'm going to try it right away!
On this episode of "Tech & Life Cafe," we introduced "interview2jppodcast." If you're interested, please be sure to check it out. See you next time!
See you then!

詳しい使い方を以下リンク先のGithubのREADMEに記載しておきましたので、興味のある方はぜひ使ってみてください。ただし、申し訳ありませんが英語ですので、英語が不得意な方は適宜翻訳していただければ幸いです。また、Node.jsという実行環境を使いますので、エンジニアの方でないと使えるようにするのが難しいかもしれません。諸々ご了承ください。

ikuo5710/interview2jppodcast: create Japanese Podcast from long inteview text

作成の経緯

今回このようなツールを作ろうと思ったきっかけは、アメリカのポッドキャストには興味深いものがとても多いと感じたからです。例えば、このブログでも紹介した Lex Fridman や Matthew Berman もそうですし、その他に Guy Kawasaki’s Remarkable People もよいコンテンツだと思います。

以下のようなXの投稿もありましたので、自分で言うのもなんですがけっこうよい所に目をつけているのではないかと思っています。

アメリカで新たな媒体としてポッドキャストが急浮上しています。日本の「動画配信」とは元手になる資本やインフラが桁違い。年商が100万ドル越えもざら。日本のBSチャンネル１個分の影響力を持つ配信者たちが目立っています。MSNBCなどで現役出演しているキャスターも長編のニュース解説配信。

出所：モーリー・ロバートソン氏のX投稿（ https://x.com/gjmorley/status/1959099426833678584 ）

このようにアメリカでは興味深いコンテンツが次々と作られているので、英語がほとんど聴き取れない自分のために日本語のポッドキャスト風音声に変換するツールを作ったというわけです。私が私のために作ったツールなので当たり前なのですが、このツールのおかげで運転中や徒歩での通勤中に先に挙げたような番組を聴けるようになったのでとても満足しています。

利用上の注意ですが、あくまで個人利用に限定してください。無許可で配信すると著作権侵害になる可能性が高いためです。私は過去にうっかりアップロードしてしまったことがありますが、それらについては全て削除いたしました。

利用技術の簡単な紹介

今回、interview2jppodcastを作るに当たっては、ほとんど Lex Firdmanの日本語版Podcastを作ってみたのエントリーで記載したとおりにプログラムを作成しました。TypeScriptははじめて使う言語でしたが、AIコーディングエージェント（Gemini CLI と Codex CLI）のおかげで何とか作り切ることができました。

1. YouTubeからトランスクリプトを取得する。：手動（APIが提供されていないため）
2. トランスクリプトを入力にして、話者分離と日本語訳したテキストファイルを作成する。：gemini-2.5-pro（ChatGPTはコンテキストウィンドウが小さいので不向き）
3. テキストファイルをTTSの長さ制限に引っかからないサイズのチャンクに分割して、非同期並列処理で音声化する。：GraphAI（フロー制御）、gemini-2.5-flash-preview-tts（TTS）
4. 音声化が全て完了したら元の順番通りに結合する。：GraphAI（フロー制御）、FFmpeg（音声合成）
5. BGMをつける。：SUNO（BGM作成）、FFmpeg（音声合成）

出所：当ブログ「Lex Firdmanの日本語版Podcastを作ってみた」のエントリーより

複数の技術を組み合わせて使いましたが、特にGraphAI（＝宣言的な独自言語でデータフローを記述することで、AIアプリケーションを構築できるライブラリ）はとても便利で感心しました。GraphAIを用いることで、長尺のトランスクリプトをチャンクに分けて並列で音声化することや、音声化のAPI（gemini-2.5-flash-preview-tts）呼び出しがエラーになった際にリトライすることを、簡潔なコードで実現できました。

一方でけっこう大変だったのが、Geminiのレート制限をかいくぐることです。私はTier1ユーザーなので、1分あたりのトークン数(上限10,000トークン)を超えない程度に1チャンク当たりのサイズを小さくする一方で、1日あたりのリクエスト数(上限100回)に収まるようにチャンク総数を少なくする必要がありました。これらの相反するパラメーターの塩梅を調整するのが難しかった（というか、面倒臭かった）です。

AIコーディングエージェントを使って何度かプログラムを作成してみて、徐々にAIにコードを書いてもらう際の勘所が分かってきたような気がします。（まだ言語化することは難しいですが…。）仕事でも使いどころはあるような気がしているので、引き続き試していきたいと思います。

あわせて読んでほしいエントリー

Lex Firdmanの日本語版Podcastを作ってみた

gemini-2.5-flash-preview-ttsによる複数スピーカーの音声合成