Close printable page

Recommendation

Harnessing the full potential of iNaturalist and other databases

Matthias Foellmer based on reviews by Clive Hambler and Catherine Scott

A recommendation of:

A pipeline for assessing the quality of images and metadata from crowd-sourced databases.

Jackie Billotte (2022), biorXiv, 2022.04.29.490112, ver 5 peer reviewed and recommended by Peer Community In Zoology https://doi.org/10.1101/2022.04.29.490112

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

A pipeline for assessing the quality of images and metadata from crowd-sourced databases.

Crowd-sourced biodiversity databases provide easy access to data and images for ecological education and research. One concern with using publicly sourced databases; however, is the quality of their images, taxonomic descriptions, and geographical metadata. The method presented in this paper attempts to address this concern using a suite of pipelines to evaluate taxonomic consistency, how well geo-tagging fits known distributions, and the image quality of crowd-sourced data acquired from iNaturalist, a crowd-sourced biodiversity database. Additionally, it provides researchers that use these datasets to report a quantifiable assessment of the taxonomic consistency. The pipeline allows users to analyze multiple images from iNaturalist and their associated metadata; to determine the level of taxonomic identification (family, genera, species) for each occurrence; whether the taxonomy label for an image matches accepted nesting of families, genera, and species; and whether geo-tags match the distribution of the taxon described using occurrence data from the Global Biodiversity Infrastructure Facility (GBIF) as a reference. Additionally, image quality is assessed using BRISQUE, an algorithm that allows for image quality evaluation without a reference photo. Entries from the order Araneae (spiders) are used as a case study. Overall, the results suggest that iNaturalist can provide large metadata and image sets for research. Given the inevitability of some low-quality observations, this pipeline provides a valuable resource for researchers and educators to evaluate the quality of iNaturalist and other crowd-sourced data.

biodiversity, iNaturalist, GBIF, metadata, pipeline, database, community science

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

خط أنابيب لتقييم جودة الصور والبيانات الوصفية من قواعد البيانات ذات المصادر الجماعية. 4649a4eaef0413e89035853f29cdfd3 التنوع البيولوجي، iNaturalist، GBIF، البيانات الوصفية، خط الأنابيب، قاعدة البيانات، علوم المجتمع

توفر قواعد بيانات التنوع البيولوجي ذات المصادر الجماعية وصولاً سهلاً إلى البيانات والصور للتعليم والأبحاث البيئية. أحد المخاوف المتعلقة باستخدام قواعد البيانات ذات المصادر العامة؛ ومع ذلك، فإن جودة صورهم وأوصافهم التصنيفية وبياناتهم الوصفية الجغرافية. تحاول الطريقة المعروضة في هذه الورقة معالجة هذا القلق باستخدام مجموعة من خطوط الأنابيب لتقييم الاتساق التصنيفي، ومدى ملاءمة وضع العلامات الجغرافية للتوزيعات المعروفة، وجودة صورة البيانات ذات المصادر الجماعية التي تم الحصول عليها من iNaturalist، وهي قاعدة بيانات للتنوع البيولوجي ذات مصادر جماعية. بالإضافة إلى ذلك، فإنه يوفر للباحثين الذين يستخدمون مجموعات البيانات هذه للإبلاغ عن تقييم قابل للقياس للاتساق التصنيفي. يتيح المسار للمستخدمين تحليل صور متعددة من iNaturalist والبيانات الوصفية المرتبطة بها؛ تحديد مستوى التحديد التصنيفي (العائلة، الأجناس، الأنواع) لكل حدث؛ ما إذا كانت تسمية التصنيف الخاصة بالصورة تتطابق مع التداخل المقبول للعائلات والأجناس والأنواع؛ وما إذا كانت العلامات الجغرافية تتطابق مع توزيع الصنف الموصوف باستخدام بيانات الحدوث من المرفق العالمي للبنية التحتية للتنوع البيولوجي (GBIF) كمرجع. بالإضافة إلى ذلك، يتم تقييم جودة الصورة باستخدام BRISQUE، وهي خوارزمية تسمح بتقييم جودة الصورة بدون صورة مرجعية. يتم استخدام الإدخالات من رتبة Araneae (العناكب) كدراسة حالة. بشكل عام، تشير النتائج إلى أن iNaturalist يمكنه توفير بيانات وصفية كبيرة ومجموعات صور للبحث. نظرًا لحتمية بعض الملاحظات منخفضة الجودة، يوفر هذا المسار موردًا قيمًا للباحثين والمعلمين لتقييم جودة iNaturalist وغيرها من البيانات ذات المصادر الجماعية.

التنوع البيولوجي، iNaturalist، GBIF، البيانات الوصفية، خط الأنابيب، قاعدة البيانات، علوم المجتمع

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un canal para evaluar la calidad de imágenes y metadatos de bases de datos colaborativas.

Las bases de datos de biodiversidad de fuentes colectivas brindan fácil acceso a datos e imágenes para la educación e investigación ecológica. Una preocupación con el uso de bases de datos de origen público; Sin embargo, lo que importa es la calidad de sus imágenes, descripciones taxonómicas y metadatos geográficos. El método presentado en este artículo intenta abordar esta preocupación utilizando un conjunto de canales para evaluar la coherencia taxonómica, qué tan bien se ajusta el etiquetado geográfico a distribuciones conocidas y la calidad de la imagen de los datos obtenidos de iNaturalist, una base de datos de biodiversidad de origen público. Además, proporciona a los investigadores que utilizan estos conjuntos de datos para informar una evaluación cuantificable de la coherencia taxonómica. El canal permite a los usuarios analizar múltiples imágenes de iNaturalist y sus metadatos asociados; determinar el nivel de identificación taxonómica (familia, género, especie) para cada ocurrencia; si la etiqueta de taxonomía de una imagen coincide con la anidación aceptada de familias, géneros y especies; y si las etiquetas geográficas coinciden con la distribución del taxón descrito utilizando datos de ocurrencia del Fondo de Infraestructura para la Biodiversidad Global (GBIF) como referencia. Además, la calidad de la imagen se evalúa mediante BRISQUE, un algoritmo que permite evaluar la calidad de la imagen sin una foto de referencia. Las entradas del orden Araneae (arañas) se utilizan como estudio de caso. En general, los resultados sugieren que iNaturalist puede proporcionar grandes metadatos y conjuntos de imágenes para la investigación. Dada la inevitabilidad de algunas observaciones de baja calidad, este canal proporciona un recurso valioso para que investigadores y educadores evalúen la calidad de iNaturalist y otros datos de origen público.

biodiversidad, iNaturalist, GBIF, metadatos, canalización, base de datos, ciencia comunitaria

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un pipeline pour évaluer la qualité des images et des métadonnées provenant de bases de données participatives.

Les bases de données participatives sur la biodiversité offrent un accès facile aux données et aux images pour l'éducation et la recherche écologiques. Une préoccupation concernant l'utilisation de bases de données publiques ; cependant, la qualité de leurs images, de leurs descriptions taxonomiques et de leurs métadonnées géographiques est importante. La méthode présentée dans cet article tente de répondre à cette préoccupation en utilisant une suite de pipelines pour évaluer la cohérence taxonomique, la mesure dans laquelle la géolocalisation s'adapte aux distributions connues et la qualité d'image des données participatives acquises à partir d'iNaturalist, une base de données participative sur la biodiversité. De plus, il permet aux chercheurs qui utilisent ces ensembles de données de rapporter une évaluation quantifiable de la cohérence taxonomique. Le pipeline permet aux utilisateurs d'analyser plusieurs images d'iNaturalist et leurs métadonnées associées ; déterminer le niveau d'identification taxonomique (famille, genres, espèces) pour chaque occurrence ; si l'étiquette taxonomique d'une image correspond à la nidification acceptée des familles, des genres et des espèces ; et si les balises géographiques correspondent à la répartition du taxon décrit en utilisant les données d'occurrence du Fonds mondial pour l'infrastructure pour la biodiversité (GBIF) comme référence. De plus, la qualité de l'image est évaluée à l'aide de BRISQUE, un algorithme qui permet d'évaluer la qualité de l'image sans photo de référence. Les entrées de l’ordre des Araneae (araignées) sont utilisées comme étude de cas. Dans l’ensemble, les résultats suggèrent qu’iNaturalist peut fournir de vastes ensembles de métadonnées et d’images pour la recherche. Étant donné le caractère inévitable de certaines observations de mauvaise qualité, ce pipeline constitue une ressource précieuse permettant aux chercheurs et aux enseignants d'évaluer la qualité d'iNaturalist et d'autres données provenant de sources participatives.

biodiversité, iNaturalist, GBIF, métadonnées, pipeline, base de données, science communautaire

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

क्राउड-सोर्स्ड डेटाबेस से छवियों और मेटाडेटा की गुणवत्ता का आकलन करने के लिए एक पाइपलाइन।

क्राउड-सोर्स्ड जैव विविधता डेटाबेस पारिस्थितिक शिक्षा और अनुसंधान के लिए डेटा और छवियों तक आसान पहुंच प्रदान करते हैं। सार्वजनिक रूप से स्रोतित डेटाबेस का उपयोग करने को लेकर एक चिंता; हालाँकि, यह उनकी छवियों, वर्गीकरण विवरण और भौगोलिक मेटाडेटा की गुणवत्ता है। इस पेपर में प्रस्तुत विधि टैक्सोनॉमिक स्थिरता का मूल्यांकन करने के लिए पाइपलाइनों के एक सूट का उपयोग करके इस चिंता को संबोधित करने का प्रयास करती है, जियो-टैगिंग ज्ञात वितरणों में कितनी अच्छी तरह फिट बैठती है, और भीड़-स्रोत जैव विविधता डेटाबेस iNaturalist से प्राप्त भीड़-स्रोत डेटा की छवि गुणवत्ता। इसके अतिरिक्त, यह उन शोधकर्ताओं को प्रदान करता है जो इन डेटासेट का उपयोग टैक्सोनोमिक स्थिरता के मात्रात्मक मूल्यांकन की रिपोर्ट करने के लिए करते हैं। पाइपलाइन उपयोगकर्ताओं को iNaturalist और उनके संबंधित मेटाडेटा से कई छवियों का विश्लेषण करने की अनुमति देती है; प्रत्येक घटना के लिए वर्गीकरण पहचान (परिवार, पीढ़ी, प्रजाति) का स्तर निर्धारित करना; क्या किसी छवि के लिए वर्गीकरण लेबल परिवारों, जेनेरा और प्रजातियों के स्वीकृत घोंसले से मेल खाता है; और क्या भू-टैग एक संदर्भ के रूप में वैश्विक जैव विविधता अवसंरचना सुविधा (जीबीआईएफ) से घटना डेटा का उपयोग करके वर्णित टैक्सोन के वितरण से मेल खाते हैं। इसके अतिरिक्त, छवि गुणवत्ता का मूल्यांकन BRISQUE का उपयोग करके किया जाता है, एक एल्गोरिदम जो संदर्भ फोटो के बिना छवि गुणवत्ता मूल्यांकन की अनुमति देता है। अरनेई (मकड़ियों) क्रम की प्रविष्टियों का उपयोग केस स्टडी के रूप में किया जाता है। कुल मिलाकर, परिणाम बताते हैं कि iNaturalist अनुसंधान के लिए बड़े मेटाडेटा और छवि सेट प्रदान कर सकता है। कुछ निम्न-गुणवत्ता वाले अवलोकनों की अनिवार्यता को देखते हुए, यह पाइपलाइन शोधकर्ताओं और शिक्षकों को iNaturalist और अन्य भीड़-स्रोत डेटा की गुणवत्ता का मूल्यांकन करने के लिए एक मूल्यवान संसाधन प्रदान करती है।

जैव विविधता, iNaturalist, GBIF, मेटाडेटा, पाइपलाइन, डेटाबेस, सामुदायिक विज्ञान

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

クラウドソーシングされたデータベースからの画像とメタデータの品質を評価するためのパイプライン。

クラウドソースの生物多様性データベースにより、生態学の教育や研究のためのデータや画像に簡単にアクセスできます。公的に提供されたデータベースの使用に関する懸念が 1 つあります。ただし、画像、分類学的説明、地理的メタデータの品質が重要です。この論文で紹介されている方法は、一連のパイプラインを使用して、分類学的一貫性、地理的タグ付けが既知の分布にどの程度適合しているか、クラウドソースの生物多様性データベースである iNaturalist から取得したクラウドソースデータの画質を評価することで、この懸念に対処しようとしています。さらに、研究者はこれらのデータセットを使用して、分類学的一貫性の定量化可能な評価を報告できます。このパイプラインを使用すると、ユーザーは iNaturalist からの複数の画像とそれに関連するメタデータを分析できます。発生ごとに分類学的識別（科、属、種）のレベルを決定する。画像の分類ラベルが科、属、種の受け入れられている入れ子と一致するかどうか。地理的タグが、地球規模生物多様性インフラ施設 (GBIF) からの出現データを参照として使用して記述された分類群の分布と一致するかどうか。さらに、画質は、参照写真なしで画質評価を可能にするアルゴリズム BRISQUE を使用して評価されます。クモ目（クモ）のエントリーがケーススタディとして使用されます。全体として、この結果は、iNaturalist が研究用に大規模なメタデータと画像セットを提供できることを示唆しています。一部の低品質の観察が避けられないことを考慮すると、このパイプラインは、研究者や教育者が iNaturalist やその他のクラウドソースデータの品質を評価するための貴重なリソースを提供します。

生物多様性、iNaturalist、GBIF、メタデータ、パイプライン、データベース、コミュニティサイエンス

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Um pipeline para avaliar a qualidade de imagens e metadados de bancos de dados de crowdsourcing.

Bases de dados de biodiversidade de crowdsourcing fornecem acesso fácil a dados e imagens para educação e pesquisa ecológica. Uma preocupação com o uso de bancos de dados de origem pública; no entanto, é a qualidade de suas imagens, descrições taxonômicas e metadados geográficos. O método apresentado neste artigo tenta abordar esta preocupação usando um conjunto de pipelines para avaliar a consistência taxonômica, quão bem a marcação geográfica se ajusta às distribuições conhecidas e a qualidade da imagem de dados de crowdsourcing adquiridos do iNaturalist, um banco de dados de biodiversidade de crowdsourcing. Além disso, fornece aos investigadores que utilizam estes conjuntos de dados para reportar uma avaliação quantificável da consistência taxonómica. O pipeline permite aos usuários analisar múltiplas imagens do iNaturalist e seus metadados associados; determinar o nível de identificação taxonômica (família, gênero, espécie) de cada ocorrência; se o rótulo taxonômico de uma imagem corresponde ao aninhamento aceito de famílias, gêneros e espécies; e se as geo-marcações correspondem à distribuição do táxon descrito usando dados de ocorrência do Global Biodiversity Infrastructure Facility (GBIF) como referência. Além disso, a qualidade da imagem é avaliada por meio do BRISQUE, um algoritmo que permite a avaliação da qualidade da imagem sem uma foto de referência. Inscrições da ordem Araneae (aranhas) são utilizadas como estudo de caso. No geral, os resultados sugerem que o iNaturalist pode fornecer grandes metadados e conjuntos de imagens para pesquisa. Dada a inevitabilidade de algumas observações de baixa qualidade, este pipeline fornece um recurso valioso para pesquisadores e educadores avaliarem a qualidade do iNaturalist e de outros dados de crowdsourcing.

biodiversidade, iNaturalist, GBIF, metadados, pipeline, banco de dados, ciência comunitária

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Конвейер для оценки качества изображений и метаданных из краудсорсинговых баз данных.

Краудсорсинговые базы данных о биоразнообразии обеспечивают легкий доступ к данным и изображениям для экологического образования и исследований. Одна из проблем связана с использованием общедоступных баз данных; однако, это качество их изображений, таксономических описаний и географических метаданных. Метод, представленный в этой статье, пытается решить эту проблему, используя набор конвейеров для оценки таксономической согласованности, того, насколько хорошо геотегирование соответствует известным распределениям, а также качества изображения краудсорсинговых данных, полученных из iNaturalist, краудсорсинговой базы данных о биоразнообразии. Кроме того, он предоставляет исследователям, которые используют эти наборы данных, чтобы сообщить о количественной оценке таксономической последовательности. Конвейер позволяет пользователям анализировать несколько изображений из iNaturalist и связанные с ними метаданные; определить уровень таксономической идентификации (семейство, род, вид) для каждого нахождения; соответствует ли метка таксономии изображения общепринятому расположению семейств, родов и видов; и соответствуют ли геотеги распространению таксона, описанному с использованием данных о встречаемости из Глобального фонда инфраструктуры биоразнообразия (GBIF) в качестве ссылки. Кроме того, качество изображения оценивается с помощью BRISQUE — алгоритма, который позволяет оценивать качество изображения без эталонной фотографии. В качестве примера использованы представители отряда Araneae (пауки). В целом результаты показывают, что iNaturalist может предоставлять большие наборы метаданных и изображений для исследований. Учитывая неизбежность некоторых наблюдений низкого качества, этот конвейер предоставляет исследователям и преподавателям ценный ресурс для оценки качества данных iNaturalist и других краудсорсинговых данных.

биоразнообразие, iNaturalist, GBIF, метаданные, конвейер, база данных, общественные науки

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

用于评估众包数据库中的图像和元数据质量的管道。

众包生物多样性数据库可以轻松访问用于生态教育和研究的数据和图像。对使用公共来源数据库的担忧之一；然而，最重要的是图像、分类描述和地理元数据的质量。本文提出的方法试图解决这个问题，使用一套管道来评估分类一致性、地理标记与已知分布的拟合程度，以及从众包生物多样性数据库 iNaturalist 获取的众包数据的图像质量。此外，它还为研究人员提供使用这些数据集来报告分类一致性的量化评估。该管道允许用户分析来自 iNaturalist 的多个图像及其相关元数据；确定每个事件的分类鉴定水平（科、属、种）；图像的分类标签是否与公认的科、属和物种嵌套相匹配；以及地理标签是否与使用全球生物多样性基础设施 (GBIF) 的出现数据作为参考所描述的分类单元的分布相匹配。此外，还使用 BRISQUE 评估图像质量，该算法无需参考照片即可进行图像质量评估。 Araneae（蜘蛛）的条目被用作案例研究。总体而言，结果表明 iNaturalist 可以为研究提供大量元数据和图像集。鉴于不可避免地会出现一些低质量的观察结果，该管道为研究人员和教育工作者评估 iNaturalist 和其他众包数据的质量提供了宝贵的资源。

生物多样性、iNaturist、GBIF、元数据、管道、数据库、社区科学

Submission: posted 03 May 2022
Recommendation: posted 11 November 2022, validated 30 November 2022

Cite this recommendation as:
Foellmer, M. (2022) Harnessing the full potential of iNaturalist and other databases. Peer Community in Zoology, 100017. https://doi.org/10.24072/pci.zool.100017

Recommendation

The popularity of iNaturalist and other online biodiversity databases to which the general public and specialists alike contribute observations has skyrocketed in recent years (Dance 2022). The AI-based algorithms (computer vision) which provide the first identification of a given organism on an uploaded photograph have become very sophisticated, suggesting initial identifications often down to species level with a surprisingly high degree of accuracy. The initial identifications are then confirmed or improved by feedback from the community, which works particularly well for organismal groups to which many active community members contribute, such as the birds. Hence, providing initial observations and identifying observations of others, as well as browsing the recorded biodiversity for given locales or the range of occurrences of individual taxa has become a meaningful and satisfying experience for the interested naturalist. Furthermore, several research studies have now been published relying on observations uploaded to iNaturalist (Szentivanyi and Vincze 2022). However, using the enormous amount of natural history data available on iNaturalist in a systematic way has remained challenging, since this requires not only retrieving numerous observations from the database (in the hundreds or even thousands), but also some level of transparent quality control.

Billotte (2022) provides a protocol and R scripts for the quality assessment of downloaded observations from iNaturalist, allowing an efficient and reproducible stepwise approach to prepare a high-quality data set for further analysis. First, observations with their associated metadata are downloaded from iNaturalist, along with the corresponding entries from the Global Biodiversity Information Facility (GBIF). In addition, a taxonomic reference list is obtained (these are available online for many taxa), which is used to assess the taxonomic consistency in the dataset. Second, the geo-tagging is assessed by comparing the iNaturalist and GBIF metadata. Lastly, the image quality is assessed using pyBRISQUE. The approach is illustrated using spiders (Araneae) as an example. Spiders are a very diverse taxon and an excellent taxonomic reference list is available (World Spider Catalogue 2022). However, spiders are not well known to most non-specialists, and it is not easy to take good pictures of spiders without using professional equipment. Therefore, the ability of iNaturalist’s computer vision to provide identifications is limited to this date and the community of specialists active on iNaturalist is comparatively small. Hence, spiders are a good taxon to demonstrate how the pipeline results in a quality-controlled dataset based on crowed-sourced data. Importantly, the software employed is free to use, although inevitably, the initial learning curve to use R scripts can be steep, depending on prior expertise with R/RStudio. Furthermore, the approach is employable with databases other than iNaturalist.

In summary, Billotte's (2022) pipeline allows researchers to use the wealth of observations on iNaturalist and other databases to produce large metadata and image datasets of high-quality in a reproducible way. This should pave the way for more studies, which could include, for example, the assessment of range expansions of invasive species or the evaluation of the presence of endangered species, potentially supporting conservation efforts.

References

Billotte J (2022) A pipeline for assessing the quality of images and metadata from crowd-sourced databases. BiorXiv, 2022.04.29.490112, ver 5 peer reviewed and recommended by Peer Community In Zoology. https://doi.org/10.1101/2022.04.29.490112

Dance A (2022) Community science draws on the power of the crowd. Nature, 609, 641–643. https://doi.org/10.1038/d41586-022-02921-3

Szentivanyi T, Vincze O (2022) Tracking wildlife diseases using community science: an example through toad myiasis. European Journal of Wildlife Research, 68, 74. https://doi.org/10.1007/s10344-022-01623-5

World Spider Catalog (2022). World Spider Catalog. Version 23.5. Natural History Museum Bern, online at http://wsc.nmbe.ch. https://doi.org/10.24436/2

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
No indication

Reviews

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2022.04.29.490112

Version of the preprint: 2

Author's Reply, 18 Oct 2022

Download author's reply https://doi.org/10.24072/pci.zool.100128.ar2

Decision by Matthias Foellmer, posted 05 Sep 2022

Dear Ms. Billotte,

Thank you for addressing the reviewers’ comments in such a thorough manner. I have only a few comments left regarding the writing/presentation:

L 9/10: avoid repetitions

L42: delete comma after “image-based”

L63/64: switch “the of” to “of the”

Figure 2: The labeling seems incomplete. In the legend, you refer to Section 1, 2, and 3. Which sections are these?

Figure 3: In the legend, last sentence, it should be “observations” (plural).

Results: please check the numbers one more time, or at least clarify. On L171, you state that you found 156,842 downloadable observations and on L176, you say that 49.91% were identified to at least family level. But on L181, you state that 156,842 out of 158,129 downloadable observations had a family-level identification. On L185, you refer to 425,950 “records”. What is a record in this context, i.e. how does it differ from an observation?

L223-225: use “research grade” throughout.

L234: delete comma after “quantifiable”

Zizka et al. 2019 is not in the reference list.

Kind regards,

Matthias Foellmer, NYC, 5 September 2022

https://doi.org/10.24072/pci.zool.100128.d2

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2022.04.29.490112

Version of the preprint: 1

Author's Reply, 18 Aug 2022

Download author's reply https://doi.org/10.24072/pci.zool.100128.ar1

Decision by Matthias Foellmer, posted 22 Jun 2022

Dear Jackie Bilotte,

In this manuscript you present a protocol to efficiently evaluate the quality of images and associated metadata of any pre-specified taxon using self-written and co-opted R and MATLAB scripts. Importantly, the usefulness of the approach extends to e.g. assessing the range expansions of invasive species or evaluating the presence of endangered species, potentially supporting conservation efforts. Given the rapidly increasing popularity of iNaturalist and similar databases to which the general public contributes, this paper can make a very valuable and timely contribution. The reviewers agree with this view and provide thoughtful suggestions for improving the clarity and facilitating the use of the various workflow steps. It would be fantastic if you could make BRISQUE work in Python, since, as one of the reviewers points out, few readers will probably have access to MATLAB, which is expensive.

In additions to the reviewers’ comments, I have the following suggestions:

L20: “… but lower in observations…” – wouldn’t “for” be more appropriate than “in”?

Data acquisition:

- You state that “For the Araneae case study, I searched for and downloaded observations for each family under the order Araneae on iNaturalist on July 21, 2021 (‘iNaturalist’, 2014). I then searched for and downloaded observations classified only to the order level.” Please explain why you employed this strategy. One would think that a single query for Araneae should give all relevant results, with all observations determined to the various taxonomic levels. After all, downloading data for 100+ families one-by-one seems an arduous endeavor I would want to avoid.

- Datasets to be downloaded from iNat can easily be very large without narrowing down the search criteria. Please detail your search strategy for at least one example. Specify the settings in the filter and the Export Observations page, so that the reader can reproduce your search results (see also the reviewers’ comments). On lines 87ff you only state the minimum requirements with respect to the fields to be included.

- When I tried to run your R code for obtaining the data directly from iNat for taxon_name = "Araneidae", I got the error message “Error in get_inat_obs(query = NULL, taxon_name = "Araneidae", taxon_id = NULL, : Your search returned too many results, please consider breaking it up into smaller chunks by year or month.” So simply searching for a given family doesn’t work, highlighting the need for a more detailed description of your search strategy.

- Your site Observation_Database_Assesment on GitHub.com currently (when I checked) only has the basic R and MATLAB code posted, but no other files. Please add example searches and data sets.

R and MATLAB code: please make sure you provide sufficient annotation so that all steps and their implementation can easily be understood even by the not-so-proficient coder.

I hope you find the reviewers’ and my comments helpful. I’m looking forward to reading your revision.

Matthias Foellmer, NYC, 22-Jun-2022

https://doi.org/10.24072/pci.zool.100128.d1

Reviewed by Catherine Scott, 06 Jun 2022

This is a valuable and timely contribution. As an arachnologist interested in using iNaturalist data, I think it provides some very useful tools and guidelines for processing and using these data.

Major comments:

Unlike R, which is freely available, MATLAB is a paid software so it is not accessible to all. If possible, it would be preferable to have a script for the image analysis in a freely available software. It seems that BRISQUE can be implemented in Python: https://github.com/bukalapak/pybrisque. I do really like the suggestion that iNaturalist automatically score image quality--this would make filtering out useful observations much easier!

While not critical for this paper, which is meant to describe methods, it would be really nice to have an example of the utility of some of the methods, perhaps for a particular family (one of the smaller ones). Going through each of the steps and making the specific small dataset and code available for readers to reproduce the analyses to familiarize themselves with the pipeline, knowing the expected results, would be very valuable. It would also be good to show an example where the GBIF and iNAT ranges do not match, and a look at whether it's because of a mislabeled observation or a true range expansion.

Minor comments:

since data were non-Normal, it might be more appropriate and informative to report medians and ranges rather than means and SE

Figure 3 caption is cut off

line 162: Araneidae must be a typo--presumably this should be one of the smaller families that start with A

line 168: "I found 158,129 of the 156,842 downloadable observations" check numbers, have they been reversed?

line 173: it would be helpful to have some explanation of what it means for an observation to be accurate or precise.

line 206: it is not quite correct that "requires that an observation reach a threshold of three votes from users to confirm an identification" as research-grade. Instead, iNaturalist states that "Observations become "Research Grade" when the community agrees on species-level ID or lower, i.e. when more than 2/3 of identifiers agree on a taxon." In practice this means that an observation can become research grade after exactly two identifiers agree on an ID.

Note: I did not have a chance to try to run any of the code myself.

https://doi.org/10.24072/pci.zool.100128.rev11

Reviewed by Clive Hambler, 01 Jun 2022

Download the review https://doi.org/10.24072/pci.zool.100128.rev12