- The NLP lab provided the following datasets for the research purposes:
Mona: Persian Named Entity Tagged Dataset
The dataset contains 3000 Persian Wikipedia Abstracts (about 100k tokens) annotated with 15 different entity types in IOB format.
The dataset contains 4 different vector representation for Persian words using different corpora (Hamshahri, IR-blog, Wikipedia, and Twitter) as well as a comprehensive model using all 4 corpora.
Related paper about this dataset is as follows:
Amir Hadifar and Saeedeh Momtazi. The Impact of Corpus Domain on Word Representation: a Study on Persian Word Embeddings. Journal of Language Resources and Evaluation, 52(4):997–1019, 2018 (download).
- Porseshgan: Persian Questions for Automatic QA over Knowledge Graph
- Dastaar: Persian Newspapers Categorization Dataset
- Didgah: Persian Aspect-based Sentiment Analysis
- Annotated Question for Multi-label Question Classification in Community Question Answering