[712]俄罗斯传记语法语料库数据集（Russian Corpus of Biographical Texts Data Set）

数据集 205 1年前 0 worker

俄罗斯传记语法语料库数据集（Russian Corpus of Biographical Texts Data Set）
免责声明：本数据由极风数据团队整理，仅用于学术研究，请勿恶意复制或用于其他用途。

0、数据编号：712
1、数据名称：俄罗斯传记语法语料库数据集（Russian Corpus of Biographical Texts Data Set）
2、数据来源：University of Tyumen, Russia
3、时间跨度：截至2020-06-03
4、区域范围：
5、数据大小：14KB
6、数据格式：csv
7、数据简介：
语料库包含维基百科文本，这些文本被分成单句，每个句子都有一个主题标签。语料库是为了自动搜索自然语言文本中包含传记信息的片段而创建的。该语料库包括200篇俄罗斯传记文章（维基百科，2018年）。

文本预处理和选择包括以下步骤：

-首先，使用开放的Python库自动进行文本的初始收集；
-我们删除了只包含一个人年数的短文本™他的生活和工作地点清单；
-我们删除了除“传记”部分以外的所有部分。这是因为维基百科上的传记文章包含奖项、科学著作、著作和其他不便于标记的部分的列表。

语料库包括主要活动与以下领域之一相关的个人传记：

-军事和执法官员；
-文化艺术形象；
-科技教育人物；
-政治家和公众人物；
-企业家和管理者；
-宗教人物。

属性信息：

语料库是一个文本集合，分为句子。每个句子都涉及一到两个主题类：非传记事实（无）；个人事件（personal_events）；专业活动（专业活动）；父母家庭的出生死亡国籍信息（父母养育）；隶属教育家庭居住地、居住地（居住地）；职业、职位（职业）；其他传记事实（其他）。

传记文本语料库由以下要素组成：

-以.xml格式显示的文本（每个句子包括属性“text”和“type”（主题类），如果可用-“additional_type”（附加主题类）；
-以.csv格式描述兵团的文件，其中包含有关文本的信息（人名、寿命、主要活动区域）。

英文原文：
Sentence classification (Russian). The corpus contains Wikipedia texts splitted into sentences/ Each sentence has a topic label.
The corpus was created for the task of automatic search for fragments containing biographical information in a text in a natural language. The corpus includes 200 Russian biographical articles (Wikipedia, 2018).

Text pre-processing and selection included the following steps:
– firstly, initial collection of texts was carried out automatically using open Python libraries;
– we deleted short texts containing only years of a personâ€™s life and a list of his places of work;
– we have deleted all sections except the ‘Biography’ section. This is due to the fact that biographical articles on Wikipedia contain lists of awards, scientific works, works and other sections that are inconvenient for marking up.
The corpus includes biographies of individuals whose main activity is related to one of the following areas:

– military and law enforcement officers;
– figures of culture and art;
– figures of science, technology and education;
– politicians and public figures;
– entrepreneurs and managers;
– religious figures.

Attribute Information:

The corpus is a text collection, divided into sentences. Each sentence refers to one or two thematic classes: non-biographical fact (none); personal events (personal_events); professional events (professional_events); birth death nationality information about the parental family (parenting)); affiliation education family place of residence, residence (residence); occupation, position (occupation); other biographical facts (other).

The corpus of biographical texts consists of the following elements:

– texts presented in .xml format (each sentence includes the attributes ‘text’ and ‘type’ (thematic class), if available – ‘additional_type’ (additional thematic class);
– a file with a description of the corps in .csv format, which contains information about the texts (name of the person, years of life, area of main activity).

参考文献：
[1]Glazkova A.V. Automatic search for fragments containing biographical information in a natural language text. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). ppbZZdJ.png (587×76) ppbZeo9.png (851×441)