• The NLP lab provided the following datasets for the research purposes:

Mona: Persian Named Entity Tagged Dataset
The dataset contains 3000 Persian Wikipedia Abstracts (about 100k tokens) annotated with 15 different entity types in IOB format.


The dataset contains 4 different vector representation for Persian words using different corpora (Hamshahri, IR-blog, Wikipedia, and Twitter) as well as a comprehensive model using all 4 corpora.

Related paper about this dataset is as follows:

Amir Hadifar and Saeedeh Momtazi. The Impact of Corpus Domain on Word Representation: a Study on Persian Word Embeddings. Journal of Language Resources and Evaluation, 52(4):997–1019, 2018 (download).


  • Porseshgan: Persian Questions for Automatic QA over Knowledge Graph


  • Dastaar: Persian Newspapers Categorization Dataset


  • Didgah: Persian Aspect-based Sentiment Analysis


  • Annotated Question for Multi-label Question Classification in Community Question Answering