Webページをテキスト化してくれるTrafilaturaを使ってみる

以前、LLM系にWebページの最新情報をわたすために、firecwalと呼ばれるツールを使ってWebページからMarkdown化していたが、どうもメインコンテンツの抽出が弱かったので、Trafilaturaと呼ばれるツールを使ってみる。

こんな特徴のあるツール。

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to commonly used formats.

導入方法

uvを使って管理する場合下記ですが、単純にpip install trafilaturaでOK. 公式ドキュメント通り。

uv init
uv add trafilatura
uv run trafilatura --markdown -u "https://..."

コマンドラインツールとして使う場合の各オプションは、ドキュメントを参考にする（https://trafilatura.readthedocs.io/en/latest/usage-cli.html）。

いくつか試してみたが、非常に高精度にメインコンテンツの抽出ができている。

出力のサンプル

Gigazine

Gigazineのような広告が多くて表もあるようなサイトです。

https://gigazine.net/news/20260225-moonshine-voice/

※画像が欲しい場合は--imagesオプションで、画像のURLがとってこれます。

uv run trafilatura --markdown -u "https://gigazine.net/news/20260225-moonshine-voice/"

# 無料で日本語もサポートしリアルタイム音声アプリをWhisperより高精度で開発できるオープンソースAIツールキット「Moonshine Voice」


リアルタイムで音声を扱うアプリケーションを作成できるオープンソースのAIツールキットが「**Moonshine Voice**」です。


**GitHub - moonshine-ai/moonshine: Fast and accurate automatic speech recognition (ASR) for edge devices**

**https://github.com/moonshine-ai/moonshine**


「Moonshine Voice」はすべてがデバイス上で実行されるため、高速かつプライベートであり、アカウントやクレジットカード、APIキーなどは必要ありません。



また、フレームワークとモデルが生配信アプリ向けに最適化されているので、ユーザーが話している間に多くの処理を行い、低遅延で応答します。


すべてのモデルが独自の最先端研究に基づきゼロからトレーニングされていて、精度はOpenAIの音声認識モデル「Whisper Large V3」よりも高いとのこと。


ライブスピーチ処理時のベンチマーク結果を単語誤り率(WER)の低い順に並べたものが以下。「Moonshine Medium Streaming」が「Whisper Large V3」を上回ったほか、「Moonshine Small Streaming」は「Whisper Small」を、「Moonshine Tiny Streaming」は「Whisper Tiny」を、それぞれ上回っています。


| モデル名 | WER | パラメーター数 | 処理速度(MacBook Pro) | 処理速度(Linux x86) | 処理速度(Raspberry Pi 5) |
|---|---|---|---|---|---|
| Moonshine Medium Streaming | 6.65% | 245 million | 107ms | 269ms | 802ms |
| Whisper Large v3 | 7.44% | 1.5 billion | 11,286ms | 16,919ms | N/A |
| Moonshine Small Streaming | 7.84% | 123 million | 73ms | 165ms | 527ms |
| Whisper Small | 8.59% | 244 million | 1940ms | 3,425ms | 10,397ms |
| Moonshine Tiny Streaming | 12.00% | 34 million | 34ms | 69ms | 237ms |
| Whisper Tiny | 12.81% | 39 million | 277ms | 1,141ms | 5,863ms |


Whisperは音声合成技術を大きく前進させた一歩で、最大のモデルであるLarge V3はGoogleやAppleといった大企業以外でも利用可能で高い精度を出すことができました。このため、Moonshineも「faster-whisper」などの大ファンだそうですが、ライブ音声インターフェースを必要とするアプリケーションを構築する中で、Whisperでは利用できない機能が必要なことに気付いたとのこと。


1点目は「Whisperは常に30秒の入力ウィンドウで動作する」という点です。普通に音声を大量に処理するときには、先にある30秒ほどの音声の塊を見つけて順次処理していけばよいので問題にはならないのですが、ライブ音声インターフェースの場合、入力ストリームを見て大きな音声の塊を作成することはできず、また、塊自体も5秒から10秒より長くなることがめったにありません。このため、エンコーダーとデコーダーで無駄な「ゼロ埋め」処理が必要となり、結果が戻るまでの待ち時間が長くなってしまいます。Moonshineは最も重要な要件として「応答性」を挙げ、通常は200ミリ秒以下のレイテンシとして定義されるため、計算能力に余裕があるプラットフォームでもユーザー体験を損ない、制約の多いデバイスでは使い物にならなくなると述べています。


2点目は「Whisperは何もキャッシュしない」という点です。音声インターフェースの要件は「ユーザーが話しているときにフィードバックを表示する」、つまり話している間にSpeech to Textモデルを繰り返し呼び出すということです。しかし、Whisperは入力がほぼ一定であっても毎回ゼロから開始するので、以前処理したことのある音声に対しても冗長な処理が発生します。ここでも不必要な待ち時間が発生し、ユーザー体験を損ないます。


3点目は「Whisperは対応言語が多くない」という点です。Whisperは単一モデルで多くの言語を処理し、翻訳することができますが、82言語のうちWERが20％以下だったのは33言語にとどまります。また、制約の多いデバイスで実行した場合にWERが20％を切るのは5言語にまで減少します。クラウドAPI経由で利用できるバージョンだと精度が上がるようですが、オープンモデルとして利用することはできません。


このほかにも、Whisperエコシステム自体は育っているものの、エッジプラットフォーム全体でみるとインターフェイスや機能、最適化のレベルが異なるため、さまざまなデバイスで実行する必要があるアプリケーションの構築が不必要に難しくなっていることも指摘されています。


このため、Moonshineはライブ音声インターフェースのニーズを適切に満たす独自モデルファミリーの作成に乗り出したとのこと。


ライブラリーはPython、iOS、Android、macOS、Linux、Windows、Raspberry Pi、IoTデバイス、ウェアラブル端末で動作可能なので、プラットフォーム間の統合も容易です。


**GitHub - moonshine-ai/moonshine: Fast and accurate automatic speech recognition (ASR) for edge devices**

**https://github.com/moonshine-ai/moonshine?tab=readme-ov-file#quickstart**


高レベルAPIは文字起こしや話者識別、コマンド認識などの一般的なタスクを処理可能で、専門家ではない人でも音声アプリケーションを構築することができるとのこと。


対応言語は英語、スペイン語、中国語(北京語)、日本語、韓国語、ベトナム語、ウクライナ語、アラビア語など多岐に渡ります。


今後はモバイル展開のためのバイナリサイズ縮小や、より多くの言語やより多くのストリーミングモデル、話者識別の改善、軽量なドメインカスタマイズなどの実装を目指していくとのことです。


**・関連記事**

**FFmpeg 8.0「Huffman」リリース、文字起こしAI「Whisper」やVulkanベースのコーデックへの正式対応など過去最大級のメジャーアップデート - GIGAZINE**


**無料・オフラインで音声・動画を文字として書き起こす「Vibe」、OpenAIのWhisperを使ってWindows・macOS・Linuxで動作可能でYouTubeにも対応 - GIGAZINE**


**Appleの新しい文字起こしAPI「SpeechAnalyzer」がスピードテストでOpenAIのWhisperを圧倒 - GIGAZINE**


**MozillaがOpenAIのWhisperベースの高性能文字起こしAI「Whisperfile」を開発中 - GIGAZINE**


**文字起こしAI「Whisper」を誰でも簡単に使えるようにした超高精度文字起こしアプリ「writeout.ai」使い方まとめ、オープンソースでローカルでも動作OK - GIGAZINE**


**・関連コンテンツ**

in AI, Posted by logc_nt

You can read the machine translated English article **Moonshine Voice is a free, open-source A…**.

正直、関連記事・関連コンテンツは不要ですが、きれいにメインコンテンツの抽出が成功しています。

ドキュメント

Trafilaturaのドキュメントをとってみます。

uv run trafilatura --markdown --images -u "https://trafilatura.readthedocs.io/en/latest/crawls.html"

# Web crawling#

A tool aiming at the discovery of links by exploration and retrieval is commonly known as (web) crawler or spider. This process involves traversing the web to extract information and identify hyperlinks (URLs) for further exploration. A crawler keeps track of and permanently sorts the links seen in order to get to new leads. Essentially, a crawler is a sort of a virtual librarian which catalogues information.

Prominent operators of web crawlers include search engine companies, which use them to build their search indexes. Additional applications include web archiving, data mining, and text analytics. In linguistic research, they can be used to build web corpora.

Efficient techniques are essential to optimize resource utilization. Trafilatura supports focused crawling, adhering to politeness rules, and efficiently navigating through links. This page shows how to perform these tasks with Python and on the command-line.

## Design decisions#

### Intra vs. inter#

A necessary distinction has to be made between intra-domain and inter-domains crawling:

Focused crawling on website level: Finding sources within a website is relatively straightforward if it is not too rich in links or too convoluted.

Broad web crawling: Hopping across multiple websites can be challenging as it requires navigating diverse domains without accumulating irrelevant data or running into technical issues.


Trafilatura offers functions to support both approaches. In practice, intra-domain crawling is often the more feasible option, especially when paired with carefully curated sources.

Another viable alternative is leveraging existing data from external crawling projects. See information on finding sources for more details.

### Concept and operation#

Crawling starts with a seed list of URLs to visit. As these pages are downloaded, a parsing module extracts specific elements. The crawler identifies relevant hyperlinks present on the pages and adds them to the list of URLs to visit, called the frontier.

The crawl frontier is initially populated with the seed set. Visited pages are removed from the frontier. A filter is applied to determine whether the extracted links should be included, prioritizing navigation pages (such as archives or categories) to maximize link gathering in few iterations. The resulting links are then added to the frontier.

Hint

See also the documentation page Compendium: Web texts in linguistics and humanities for more details.

### Characteristics#

The spider module implements politeness rules as defined by the Robots Exclusion Standard, where applicable.

Duplicate removal is also implemented, which involves both URL- and text-level analysis. This allows the crawler to detect and avoid revisiting previously crawled URLs or web pages with identical content.

It is safe to crawl a fairly high number of websites and pages per host, bounding factors are time (waiting between requests on the same host), bandwidth (for concurrent downloads), and RAM (above millions of URLs to track).

## With Python#

### Focused crawler#

The `focused_crawler()`

function integrates all necessary components and can be customized using various arguments. To use it, you will need to import the corresponding module and call the function with a URL to start from (`homepage`

parameter). The function also accepts optional parameters:

`max_seen_urls`

: the maximum number of pages to visit (default: 10)`max_known_urls`

: the maximum number of pages to “know” about (default: 100000)`todo`

: provide a previously generated list of pages to visit (i.e. a crawl frontier)`known_links`

: provide a list of previously known pages`lang`

: try to target links according to language heuristics (two-letter code)

The following example demonstrates how to set up a focused crawler to extract internal links from a given website:

```

>>> from trafilatura.spider import focused_crawler
>>>
# perform the first iteration (will not work with this website, there are no internal links)
>>>
>>> to_visit, known_links = focused_crawler("<https://example.org>", max_seen_urls=1)

```

### Step by step#

The function returns two values, a snapshot of the current crawling state. Since the collected links can be downloaded and processed at a later time, it is recommended to progress in a step-by-step manner to save and examine data between runs.

The `to_visit`

variable keeps track of what is ahead and the `known_links`

variable ensures that the same pages are not visited twice. As this requirement can vary depending on the use case (e.g. checking new pages every day on a homepage) these variables are optional. Other parameters include `config`

(see settings file) and `rules`

(politeness rules, defaults to the ones provided by the website or safe values).

```

# perform another iteration using previously collected information
>>>
>>> to_visit, known_links = focused_crawler("<https://example.org>", max_seen_urls=10, max_known_urls=100000, todo=to_visit, known_links=known_links)

```

In this example, the crawler stops after seeing a maximum of 10 URLs or registering a total of 100,000 URLs on the website, whichever comes first. Setting both parameters to high values can result in a significant increase in processing time.

You can also use a custom configuration and pass politeness rules to the crawler. For more information see the documentation of the function.

You can determine the course of a crawl by checking if there are still navigation pages to visit using the `is_still_navigation()`

function:

```

>>> from trafilatura.spider import is_still_navigation
>>> is_still_navigation(to_visit)
>>>
# returns True or False

```

For more info please refer to the core functions page.

## On the command-line#

Two options are available on the command-line:

`--crawl`

: crawl a fixed number of pages within the website`--explore`

: combination of sitemap and crawl (uses sitemaps if possible)

On the CLI the crawler automatically works its way through a website, stopping at a maximum of 30 page visits or exhaustion of the total number of pages on the website, whichever comes first.

```

$ trafilatura --crawl "<https://www.example.org>" > links.txt

```

It can also crawl websites in parallel by reading a list of target sites from a list using the `-i`

/`--input-file`

option.

Note

The `--list`

option does not apply here. Unlike with the `--sitemap`

or `--feed`

options, the URLs are simply returned as a list instead of being retrieved and processed. This allows for examination of the collected URLs prior to further downloads. For more information on refining and filtering URL collections, see the underlying courlan package.

## Further reading#

Boldi, P., Codenotti, B., Santini, M., & Vigna, S. (2004). Ubicrawler: A scalable fully distributed web crawler. Software: Practice and Experience, 34(8), 711-726.

Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer networks and ISDN systems, 30(1-7), 161-172.

Cho, J. (2001). Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data, PhD dissertation, Dept. of Computer Science, Stanford University.

Hirai, J., Raghavan, S., Garcia-Molina, H., & Paepcke, A. (2000). WebBase: A repository of web pages. Computer Networks, 33(1-6), 277-293.

Olston, C., & Najork, M. (2010). Web crawling. Now Publishers Inc.

Shkapenyuk, V., & Suel, T. (2002). Design and implementation of a high-performance distributed web crawler. In Proceedings 18th International Conference on Data Engineering (pp. 357-368). IEEE.

結論

firecrawlとの詳細な比較は行わないが、体感として圧倒的にtrafilaturaの方がメインコンテンツの取得性能が高い。しかも、Pythonのパッケージとしてインストールして、コマンドラインから実行できることもあり、手軽さも非常に優秀。

リストにしたURLから自動でMarkdownを落としてくれるようなwrapperを用意したら、非常に便利だと思う。