首页|A coral-reef approach to extract information from HTML tables

A coral-reef approach to extract information from HTML tables

扫码查看
This article presents Coraline, which is a new table-understanding proposal. Its novelty lies in a coral-reef optimisation algorithm that addresses the problem of feature selection in synchrony with a clustering technique and some custom heuristics that help extract information in a totally unsupervised manner. Our experimental analysis was performed on a large collection of tables with a variety of layouts, encoding problems, and formatting alternatives. Coraline could achieve an F-1 score as high as 0.90 and took 7.07 CPU seconds per table, which improves on the best supervised proposal by 6.67% regarding effectiveness and 40.54% regarding efficiency; it also improves on the best unsupervised proposal by 11.11% regarding effectiveness while it remains very competitive regarding efficiency. (C) 2021 Elsevier B.V. All rights reserved.

HTML tablesInformation extractionCoral-reef optimisationFeature selectionClusteringWEB DATA EXTRACTIONOPTIMIZATIONALGORITHM

Jimenez, Patricia、Corchuelo, Rafael、Roldan, Juan C.

展开 >

Univ Seville

2022

Applied Soft Computing

Applied Soft Computing

EISCI
ISSN:1568-4946
年,卷(期):2022.115
  • 2
  • 43