Datasets

  • The NLP lab provided the following datasets for the research purposes:

Mona: Persian Named Entity Tagged Dataset
The dataset contains 3000 Persian Wikipedia Abstracts (about 100k tokens) annotated with 15 different entity types in IOB format.

 

The dataset contains 4 different vector representation for Persian words using different corpora (Hamshahri, IR-blog, Wikipedia, and Twitter) as well as a comprehensive model using all 4 corpora.

Related paper about this dataset is as follows:

Amir Hadifar and Saeedeh Momtazi. The Impact of Corpus Domain on Word Representation: a Study on Persian Word Embeddings. Journal of Language Resources and Evaluation, 52(4):997–1019, 2018 (download).

 

  • Porseshgan: Persian Questions for Automatic QA over Knowledge Graph

 

  • Dastaar: Persian Newspapers Categorization Dataset

 

  • Didgah: Persian Aspect-based Sentiment Analysis

 

  • Annotated Question for Multi-label Question Classification in Community Question Answering