Conference Publication Details
Mandatory Fields
Barros, JM,Buitelaar, P,Duggan, J,Rebholz-Schuhmann, D,
Unsupervised Classification of Health Content on Reddit
PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON DIGITAL PUBLIC HEALTH (DPH '19)
2019
January
Published
1
()
Optional Fields
health informatics unsupervised learning word embeddings discussion forum clustering vocabulary building ONLINE INFORMATION FORUMS PEOPLE TRUST
85
89
Online forums are easily accessible to the public and useful to acquire and disseminate health information, however, advanced methods have to be applied to correctly interpret the content. For this reason, we propose the application of an unsupervised embedding-based approach for health content classification. Specifically, we utilise word embeddings and a clustering method to create content-sensitive word clusters; we then align the health content with the clusters classifying it into illnesses/medication/disease agents. The results suggest that a cosine similarity of 0.70 is preferred for the creation of informative clusters as well as for the automatically generation of synonyms, acronyms, abbreviations and common misspellings. Our approach does not only demonstrate the potential given by discussion forums, in particular, Reddit, for unsupervised content classification but also for dictionary building from informal health content.
10.1145/3357729.3357745
Grant Details
Publication Themes