Schizophrenia is one of the mental disorders that impacts a person’s thinking, speech, and actions. It can reduce a person’s ability to process auditory information and make decisions. Analyzing this disorder correctly is important because it might help with different ways of reducing its negative effects on its patients. Linguists and psychiatrists have been investigating language impairments and speech disorder in people with schizophrenia disorder which can be challenging. In this study, we attempt to address this issue by analyzing linguistic features i.e. cohesion in the writings and speech scripts of schizophrenia patients. Our results show that using referential cohesion with text erasability or situation model features provides the best performance for speech whereas for writing dataset, readability, or a combination of situation model and readabilityyield the best performance.

Schizophrenia is a psychotic disorder where the main symptom is that one has an impaired perception of reality [1]. It impairs the normal functioning of the brain in such a way that the manner in which an individual thinks, expresses himself, or herself or relates with others becomes distorted [2]. Furthermore, it can significantly impair functional abilities such as learning ability and social interactions with others[3].

Currently, more than 21 million people globally, suffer from Schizophrenia [4], and there is a need for a deeper understanding of its conditions. This could be critical in not only assessing the patients but also in identifying them so that they can receive the appropriate medical care in a timely manner.

Language can play a crucial role in identifying someone’s mental illness [5]. Previous studies have shown how language can help in diagnosing and predicting mental illness e.g.identify people who suffer from: depression and anxiety [6]–[10], Alzheimer’s [11]–[13], post-traumatic stress disorder(PTSD) [14], or schizophrenia [15]–[17]. Specifically for schizophrenia, there can be impaired coherence and overall lack of contextual structure [18]. Hence, in this work, we investigate linguistic features related to cohesion for two datasets (1) recorded and transcribed speech; and (2) written essays, with the end goal of identifying and classifying patients with schizophrenia. For this purpose, we trained two machine learning models, namely Support Vector Machine (SVM)and Random Forests (RF) to classify patients and controls. Our results show that among all cohesion features, situation model and readability performed the best for writing dataset and combination of referential cohesion, text erasability, and situation model for speech.


Patients with schizophrenia have different cognitive symptoms, some of which involve problems with concentration and memory, which in return may lead to disorganization in speech or behavior. Diagnosing this disorder early and correctly is extremely important as it may help alleviate the negative effects on its patients. Even though previous works have investigated language impairments and speech disorder in people with schizophrenia disorder, the availability of recordings of spoken language, as well as writings, provides an opportunity to systematically analyze the language used by patients. Among the linguistic features of cohesion that were investigated for this study, we found that a combination of features such as referential cohesion, text erasability, and situational model features provide the biggest boost in classification performance for LabSpeech dataset. For LabWriting dataset, readability and situation model for SVM performs the best performance, and a combination of features such as RC andRead* for RF have the best performance. In the future, we will explore other features of cohesion such as connectives, which create cohesive connections between ideas and clauses and show how the text is organized. We also plan to collect more data from social media such as Reddit fora similar analysis in this study. Finally, we plan to expand our analysis to other related mental health disorders.


We would like to thank the Coh-Metrix team for granting access to the tool and providing valuable support to us using it for the analyses performed in this study. We also would like to thank Michael Compton for granting access to Writing and speech datasets.

