开源软件在大规模发展与普及的同时也构筑了一个开源开发与协同的生态系统,在这个系统中,个人与组织协同开发所有人都可以使用的高质量软件.以GitHub为代表的社会化协作平台进一步促进了大规模、分布式、细粒度的代码协作与技术社交,无数开发者每天在其上提交代码、评审代码、报告bug,或提出新的功能请求,如何利用这些海量的协作行为数据挖掘有价值的信息是当前的研究难点.因此,设计并实现了一个面向开源协作数字生态的一站式数据挖掘系统OpenDigger,目标是构建开源领域的数据基础设施,促进开源生态的持续发展.OpenDigger系统主要由数据采集服务、数据存储模块、标签数据模块和信息服务模块构成,它基于OLAP列式数据库和图数据库,持续采集多源开源生态数据,并通过统一的接口为不同用户群体提供各类开源信息服务.OpenDigger从协作关系网络视角挖掘开源数字生态中的关键信息,相比传统统计指标,协作网络视角更好地展现了开源项目与开发者的关联特性,用户可以使用在线分析环境或CLI工具对开源生态数据进行建模与分析.OpenDigger服务于蚂蚁金服、阿里巴巴、木兰开源社区等多家企业与社区,为OSPO(Open Source Program Office,开源办公室)从业者和开源项目运营负责人提供开源数字洞察能力.
Data Mining and Information Service for Open Collaboration Digital Ecosystem
Large-scale development and proliferation of open source software has constructed an ecosystem for open source deve-lopment and collaboration.Within this system,individuals and organizations collaboratively develop high-quality software that is accessible to all.Social collaboration platforms,represented by GitHub,have further facilitated large-scale,distributed,and fine-grained code collaboration and technical socialization.Countless developers submit code,review code,report bugs,or propose new feature requests on these platforms every day.This results in a vast amount of behavioral data from the fully open collaborative development process,which holds immense value.This paper designs and implements a one-stop data mining system for the open source collaboration digital ecosystem,named OpenDigger.Its goal is to build data infrastructure in the open source field and pro-mote the continuous development of the open source ecosystem.OpenDigger system consists primarily of data collection module,storage module,tag data module,and information service module.It is built upon an OLAP columnar database and a graph data-base.The system continuously collects data from multiple sources within the open-source ecosystem and provides various types of open-source information services to different user groups through a unified interface.Additionally,OpenDigger mines key infor-mation from the open-source digital ecosystem through the perspective of collaborative relationship networks.Compared to tradi-tional statistical indicators,the collaborative network perspective better illustrates the association characteristics between open-source projects and developers.
Open source ecosystemOpen collaborationData miningInformation systemGraph analysis