MOTIVATION:
The task of eXtreme Multi-Label Classification (XMLC) deals with the problem of assigning multiple labels to a data object. The specific challenge is that the classification of a data object is conducted in k few labels given a pool of hundreds or thousands of labels, i.e., k << n. XMLC is of tremendous practical importance and industries and organizations in various fields such as retailing, web-content recommender, scientific libraries, new providers and others are dealing with it.
OBJECTIVE:
After Assignment 1 focused on the theoretical aspects of text classification, the objective of his assignment is to get practical experience in designing, implementing, running and scientifically evaluating your own XML classifier. The dataset will be a collection of two very large-scale resources of scientific papers in economics and medicine. Each `document’ in this dataset represents a record from a large document corpus. You can also use further datasets of your choice.
Task Description: eXtreme Multi-Label Classification of Scientific Papers
Multi-label classification is one of the standard tasks in text analytics. The objective of the assignment is to perform an extreme multi-label classification, short XMLC. In an XMLC setting, there are k many labels from a large pool of n labels to be assigned to the data objects. The classification task is extreme in two senses: First, the number of n labels is very large with hundreds or thousands of labels. Second, there are only very few k labels to assign, i. e. it holds k <<n. Thus, it is likely to have false positives.
Data description
We compiled two English datasets from two digital libraries, EconBiz and PubMed.
EconBiz
The EconBiz dataset was compiled from a meta-data export provided by ZBW – Leibniz Information Centre for Economics from July 2017. We only retained those publications that were flagged as being in English
and that were annotated with STW labels. Afterward, we removed duplicates by checking for the same title and labels. In total, approximately 1,064k publications remain. The annotations were selected by human annotators from the Standard Thesaurus Wirtschaft (STW), which contains approximately 6,000 labels.
PubMed
The PubMed dataset was compiled from the training set of the 5th BioASQ challenge on large-scale semantic subject indexing of biomedical articles, which were all in English. Again, we removed duplicates by checking for same title and labels. In total, approximately 12.8 million publications remain. The labels are so called MeSH terms. In our data, approximately 28k of them are used.
Fields Both datasets share the same set of fields:
- id: An identifier used to refer to the publication in the respective digital library,
- title: The title of the publication
- labels: A string that represents a list of labels, separated by TAB.
- fold: For reproducibility of the results in our study: Number of the fold a sample belongs to as used in our study. 0 to 9 correspond to the samples that have a full-text, fold 10 to all other samples.
Tasks
Your tasks will be as follows:
- In Task 1, you will – based on the basic review of text classification approaches in Assignment 1 – add a review of literature on the state of the art in multi-label classification. Particular attention shall be given on XMLC approaches. We expect this to be between two to four pages using ACM style plus references, which do not count to the page limit.
- In Task 2, you will devise and train your own XML classifier using a suitable set of features extracted from the given dataset. You can use any tools you like for extracting the features, e.g. any of the tools you have come across in the labs such as NLTK, ScikitLearn etc. You can also use the implementation
given with the paper by Mai et al., 2018. But please note that simply rerunning the code is not sufficient. You need to design your own new classifier, implement them and run the classifier and evaluate their performance on XMLC datasets. - In Task 3, you will write a scientific report explaining what you did in Task 2 together with some motivation (why you did it) and reflection (which alternatives did you consider, lessons learned etc). You will compare and contrast your results with what you found in Task 1, i. e. discuss and reflect on results in a scientifically sound manner. We expect this to be between eight to ten pages using ACM style plus references, which do not count to the page limit. Please note: You can integrate the report of Task 1 in Task 3. In this case, indicate this clearly in the report of Task 3.
TASK 1: XMLC literature review (report of 3-4 pages in ACM style plus
references)
Whenever you develop a new classifier or other text analytics software, you have to show that your system outperforms the state of the art. The goal of this part of the assignment is to explore the landscape of XMLC approaches. This review should highlight what multi-label classification algorithms are out there and which approaches and features have led to the best classification accuracy.
In Assignment 1, you have already identified simple approaches that can easily be implemented and which serve as simple baseline systems. An example of a baseline system could be one that simply uses tokens as features. We would expect that your own XML classifier will outperform a simple baseline system but may also obtain or even exceed results of state-of-the-art approaches. You can start by looking at the papers referenced in this assignment.
TASK 2: Devising and training your own classifier
This task will only be marked if you have completed Task 1.
For this task, you will develop your own XML classifiers using the features of your choice and the classification algorithms of your choice. The pre processing steps and classification tools you have come across in the labs should get you started. Doing this will involve:
- Identifying the approaches and features you want to extract;
- Developing code to extract these features from text (and weigh them if you want to use more than simply binary features);
- Train a classifier that uses these features;
- Evaluate the performance of the classifier using a scientifically sound methodology.
- Initially start with a subset of the large datasets. Subsequently, scale the classifier to include more and more data until you are using the entire datasets or very large parts of if. You can request access to the HPC CERES with your student account, see: https://hpc.essex.ac.uk/
TASK 3: Report (8-10 pages in ACM style plus references)
This task will only be marked if you have completed Tasks 1 and 2.
Finally, you will write a report documenting what you did in Task 2 and comparing and contrasting your approach with what you discussed in Task 1. You should explain why you decided on the algorithms and features you used and how this compares to the state of the art. You should discuss the performance of your approach and reflect on what you have learned