ビッグデータ時代の七面鳥実験データの保存、結合、分析

/ /

日本語AIでPubMedを検索

PubMedの提供する医学論文データベースを日本語で検索できます。AI(Deep Learning)を活用した機械翻訳エンジンにより、精度高く日本語へ翻訳された論文をご参照いただけます。

Animal.2020 Jun;:1-7. S175173112000155X. doi: 10.1017/S175173112000155X.Epub 2020-06-22.

ビッグデータ時代の七面鳥実験データの保存、結合、分析

Storing, combining and analysing turkey experimental data in the Big Data era.

D Schokker
I N Athanasiadis
B Visser
R F Veerkamp
C Kamphuis

PMID: 32624081 DOI: 10.1017/S175173112000155X.

抄録

家畜領域における大量のデータの利用可能性が高まる中、これらのデータを効率的に保存、結合、分析するという課題に直面しています。この研究では、スケーラビリティと相互運用性を向上させるために、データの保存と分析にデータレイクを使用することを検討しました。データは、約200羽の七面鳥の歩行スコアが専門家の目視検査によって決定された2日間の動物実験に由来しています。さらに、慣性計測ユニット（IMU）、3Dビデオカメラ、フォースプレート（FP）を設置して、目視による歩行スコアの自動化におけるこれらのセンサーの有効性を調査しました。私たちは、その動物実験の1日分のIMUとFPのデータを使ってデータレイクを展開しました。これには、「抽出、変換、ロード」（ETL-）手順を実行して前処理を行った84羽の七面鳥のデータが含まれています。ETL手順のスケーラビリティをテストするために、この動物実験から利用可能なデータの量を増やしてシミュレーションし、FPデータをカンマ区切りのファイルに変換し、これらのファイルを保存するための「壁時間」（経過した実時間）を計算しました。30000羽の七面鳥のシミュレーションデータセットでは、1コアに対して12コアを使用した場合、壁時間は1時間から15分未満に短縮されました。これにより、ETL処理がスケーラブルであることが実証されました。その後、機械学習（ML）パイプラインを開発し、データレイクが2つのクラス、すなわち、非常に悪い歩行スコアと他のスコアを自動的に区別する可能性をテストしました。結論として、我々は専用のカスタマイズされたデータレイクを設置し、データをロードし、MLパイプラインを作成して予測モデルを開発しました。データレイクは、増加する性質の異なる大量のデータを効率的な方法で保存し、組み合わせ、分析するという課題に直面するための有用なツールであるように思われます。

With the increasing availability of large amounts of data in the livestock domain, we face the challenge to store, combine and analyse these data efficiently. With this study, we explored the use of a data lake for storing and analysing data to improve scalability and interoperability. Data originated from a 2-day animal experiment in which the gait score of approximately 200 turkeys was determined through visual inspection by an expert. Additionally, inertial measurement units (IMUs), a 3D-video camera and a force plate (FP) were installed to explore the effectiveness of these sensors in automating the visual gait scoring. We deployed a data lake using the IMU and FP data of a single day of that animal experiment. This encompasses data from 84 turkeys for which we preprocessed by performing an 'extract, transform and load' (ETL-) procedure. To test scalability of the ETL-procedure, we simulated increasing volumes of the available data from this animal experiment and computed the 'wall time' (elapsed real time) for converting FP data into comma-separated files and storing these files. With a simulated data set of 30 000 turkeys, the wall time reduced from 1 h to less than 15 min, when 12 cores were used compared to 1 core. This demonstrated the ETL-procedure to be scalable. Subsequently, a machine learning (ML) pipeline was developed to test the potential of a data lake to automatically distinguish between two classses, that is, very bad gait scores v. other scores. In conclusion, we have set up a dedicated customized data lake, loaded data and developed a prediction model via the creation of an ML pipeline. A data lake appears to be a useful tool to face the challenge of storing, combining and analysing increasing volumes of data of varying nature in an effective manner.