政府开放数据中个人信息披露识别与统计方法

Identification and statistical analysis methods of personal information disclosure in open government data

陈海粟 ¹廖佳纯 ¹姚思诚¹

扫码查看

作者信息

1. 南湖实验室大数据技术研究中心,浙江嘉兴 314002
折叠

摘要

为推进数据开放过程中个人信息保护,深入分析政府开放数据中个人信息的披露现状:首先从相关平台中获取数据,并对其预处理,根据字段、表名等特征筛选出含有个人信息的数据;其次利用敏感信息识别方法识别数据中各类个人信息,并将其映射到个体,以统计个体数量同时检测其关联数据;最后通过数据可视化,直观展示个人信息披露现状.虽然部分公共数据开放平台虽然对公共数据进行分级分类以及去标识化等处理,但是已开放的数据中依旧包含大量直接展示的个人信息,需要在数据规范化分级分类、敏感信息识别和敏感信息脱敏等方面进行完善.

Abstract

To promote the protection of personal information during data opening,an in-depth analysis of the current status of disclo-sure of personal information in the open government data is conducted.Firstly,the paper obtains the datasets from relevant platforms and pre-process to classify the datasets that containing personal information based on features such as field and table names,etc.Then,methods of sensitive information identification are applied to identify and extract various types of personal information in the data,and map the information back to individuals to summarise the total number of individuals and detect their associated data.Through data visualizations,the current status of personal information disclosure could be examined.Although some open govern-ment data platforms may have implemented certain measures such as data categorization and de-identification,the published open datasets still contain a large amount of personal information,which is required to be improved in terms of data categorization and classification,sensitive information identification and data desensitization in a normative and accurate manner.

关键词

大数据隐私/个人信息/政府开放数据/信息识别/统计分析

Key words

big data privacy/personal information/open government data/information identification/statistical analysis

引用本文复制引用

基金项目

南湖实验室小微课题(NSS2023C2002)

出版年

2024

山东大学学报(理学版)

山东大学

山东大学学报(理学版)

CSTPCDCSCD北大核心

影响因子：0.437

ISSN：1671-9352

参考文献量17

段落导航