Research on Big Data Mining and Spatial Clustering Model for Marine Environmental Climate News
钟鸣 1张建辉 1毕文璐 1李金蓉1
扫码查看
点击上方二维码区域,可以放大扫码查看
作者信息
1. 国家海洋信息中心,天津 300171
折叠
摘要
以GDELT(global database of event,language,tone)数据库为例,讨论使用数据源路径爬取相关新闻文档.利用改进的AC自动机进行多模关键词匹配完成初步的数据清洗;对过滤好的文档数据进行主题数量评估,再利用LDA模型对其进行主题分类和关键词提取.根据分类结果,对海洋环境与气候主题新闻数据及相关指标建立空间聚类模型,最终形成一个对海量文档数据进行抓取、清洗、主题挖掘、空间聚类及可视化呈现的分析模型.
Abstract
This paper uses the data source path of GDELT to crawl relevant news documents,and uses the improved AC autom-aton for multi-mode keyword matching to complete the preliminary data cleaning.It evaluates the number of topics on the fil-tered document data,uses the LDA model to classify the topics and extract keywords.According to the classification results,it establishes a spatial clustering model for marine environment,climate themed news data and related indicators.An analysis model is established for crawling,cleaning,topic mining,spatial clustering and visual presentation of massive document data.