Automatic identification technology of unstructured document content in power grid OA system
To solve the problem that there are a large number of unstructured documents in power grid OA system,which are difficult to identify,the automatic identification technology of unstructured document con-tent in power grid OA system is studied.The indirect conversion method is used to convert the unstructured data into the incomplete structured data carried by XML file,and the incomplete structured data is parsed by SAX parsing tool.In addition,the text information is de-duped by Simhash algorithm.The TextRank al-gorithm is used to extract the keywords in the text,and identify the unstructured document content of power grid OA system according to the keywords.The test results show that under the condition that the Hamming distance and similarity threshold are 10 and 70 respectively,good de-duplication effect can be obtained,and the keyword extraction effect is good,which has popularization value.