Hopper: 生物学的データをスケッチするための数学的に最適なアルゴリズム

/ /

日本語AIでPubMedを検索

PubMedの提供する医学論文データベースを日本語で検索できます。AI(Deep Learning)を活用した機械翻訳エンジンにより、精度高く日本語へ翻訳された論文をご参照いただけます。

Bioinformatics.2020 Jul;36(Supplement_1):i236-i241. 5870483. doi: 10.1093/bioinformatics/btaa408.

Hopper: 生物学的データをスケッチするための数学的に最適なアルゴリズム

Hopper: a mathematically optimal algorithm for sketching biological data.

Benjamin DeMeo
Bonnie Berger

PMID: 32657375 PMCID: PMC7355272. DOI: 10.1093/bioinformatics/btaa408.

抄録

動機:

単細胞RNAシーケンスは、その開始以来、その規模が非常に大きくなっており、解析と計算の面で大きな課題となっています。次元削減やクラスタリングなどの単純なダウンストリーム解析でさえ、今日の最大のデータセットでは数日のランタイムと数百ギガバイトのメモリを必要とします。さらに，現在の方法では，一般的な細胞タイプを好むことが多く，小さな細胞集団で捉えた顕著な生物学的特徴を見逃してしまうことがあります．

MOTIVATION: Single-cell RNA-sequencing has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for today's largest datasets. In addition, current methods often favor common cell types, and miss salient biological features captured by small cell populations.

結果:

ここでは、単細胞データセットの解析を高速化し、インテリジェントなサブサンプリング（スケッチ）によって転写の多様性を強調する単細胞ツールキットであるHopperを紹介する。Hopperは、完全データセットとダウンサンプリングされたデータセット間のハウズドルフ距離の最適な多項式時間近似を実現し、各細胞がサンプル中のいくつかの細胞によって十分に代表されていることを保証する。従来のスケッチ法とは異なり、Hopperは反復的にポイントを追加し、関心のある領域からの追加サンプリングを可能にし、高速でターゲットを絞った多分解能解析を可能にします。130万個以上のマウス脳細胞のデータセットでは、5000個の細胞を含むHopperスケッチから、炎症性遺伝子を発現する64個のマクロファージのクラスター（全データセットの0.004％）を検出しました。さらに大規模なデータセットである約200万個の発育中のマウスの器官細胞からなるデータセットでは、これまでのスケッチ手法とは対照的に、Hopperが小さなスケッチで重要な細胞タイプを均等に表現していることを示した。また、空間分割を利用して、性能の低下を最小限に抑えながらHopperを桁違いに高速化するTreehopperも紹介しています。大規模なデータセットに含まれる転写情報を凝縮することで、HopperとTreehopperは、ラップトップを持つ個人ユーザーに大規模なコンソーシアムの解析能力を提供する。

RESULTS: Here we present Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching. Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample. Unlike prior sketching methods, Hopper adds points iteratively and allows for additional sampling from regions of interest, enabling fast and targeted multi-resolution analyses. In a dataset of over 1.3 million mouse brain cells, Hopper detects a cluster of just 64 macrophages expressing inflammatory genes (0.004% of the full dataset) from a Hopper sketch containing just 5000 cells, and several other small but biologically interesting immune cell populations invisible to analysis of the full data. On an even larger dataset consisting of ∼2 million developing mouse organ cells, we show Hopper's even representation of important cell types in small sketches, in contrast with prior sketching methods. We also introduce Treehopper, which uses spatial partitioning to speed up Hopper by orders of magnitude with minimal loss in performance. By condensing transcriptional information encoded in large datasets, Hopper and Treehopper grant the individual user with a laptop the analytic capabilities of a large consortium.

利用可能性と実装:

Hopper のコードは https://github.com/bendemeo/hopper から入手可能です。また、最大規模の単細胞データセットの多くのスケッチを http://hopper.csail.mit.edu で提供しています。

AVAILABILITY AND IMPLEMENTATION: The code for Hopper is available at https://github.com/bendemeo/hopper. In addition, we have provided sketches of many of the largest single-cell datasets, available at http://hopper.csail.mit.edu.