An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements

Fahrudin, Tresna Maulana; Funabiki, Nobuo; Brata, Komang Candra; Naing, Inzali; Aung, Soe Thandar; Muhaimin, Amri; Prasetya, Dwi Arman

doi:10.3390/fi17050195

Permalink : https://ousar.lib.okayama-u.ac.jp/69414

ID	69414
フルテキストURL	fulltext.pdf 11.6 MB
著者	Fahrudin, Tresna Maulana Department of Information and Communication Systems, Okayama University Funabiki, Nobuo Department of Information and Communication Systems, Okayama University Kaken ID publons researchmap Brata, Komang Candra Department of Information and Communication Systems, Okayama University Naing, Inzali Department of Information and Communication Systems, Okayama University Aung, Soe Thandar Department of Information and Communication Systems, Okayama University Muhaimin, Amri Department of Data Science, Universitas Pembangunan Nasional Veteran Jawa Timur Prasetya, Dwi Arman Department of Data Science, Universitas Pembangunan Nasional Veteran Jawa Timur
抄録	Nowadays, accessibility to academic papers has been significantly improved with electric publications on the internet, where open access has become common. At the same time, it has increased workloads in literature surveys for researchers who usually manually download PDF files and check their contents. To solve this drawback, we have proposed a reference paper collection system using a web scraping technology and natural language models. However, our previous system often finds a limited number of relevant reference papers after taking long time, since it relies on one paper search website and runs on a single thread at a multi-core CPU. In this paper, we present an improved reference paper collection system with three enhancements to solve them: (1) integrating the APIs from multiple paper search web sites, namely, the bulk search endpoint in the Semantic Scholar API, the article search endpoint in the DOAJ API, and the search and fetch endpoint in the PubMed API to retrieve article metadata, (2) running the program on multiple threads for multi-core CPU, and (3) implementing Dynamic URL Redirection, Regex-based URL Parsing, and HTML Scraping with URL Extraction for fast checking of PDF file accessibility, along with sentence embedding to assess relevance based on semantic similarity. For evaluations, we compare the number of obtained reference papers and the response time between the proposal, our previous work, and common literature search tools in five reference paper queries. The results show that the proposal increases the number of relevant reference papers by 64.38% and reduces the time by 59.78% on average compared to our previous work, while outperforming common literature search tools in reference papers. Thus, the effectiveness of the proposed system has been demonstrated in our experiments.
キーワード	reference paper collection multiple API integration PDF accessibility open access multiple threads
発行日	2025-04-28
出版物タイトル	Future Internet
巻	17巻
号	5号
出版者	MDPI AG
開始ページ	195
ISSN	1999-5903
資料タイプ	学術雑誌論文
言語	英語
OAI-PMH Set	岡山大学
著作権者	© 2025 by the authors.
論文のバージョン	publisher
DOI	10.3390/fi17050195
Web of Science KeyUT	001497484600001
関連URL	isVersionOf https://doi.org/10.3390/fi17050195
ライセンス	https://creativecommons.org/licenses/by/4.0/
Citation	Fahrudin, T.M.; Funabiki, N.; Brata, K.C.; Naing, I.; Aung, S.T.; Muhaimin, A.; Prasetya, D.A. An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements. Future Internet 2025, 17, 195. https://doi.org/10.3390/fi17050195