
In other aspects, the word tokenizing technique is used to handle rarely observed words in the corpus 8. In conventional word embedding, a word can be represented by the numeric vector designed to consider relative word meaning as known as word2vec 7.

There have been many studies for word embeddings to deal with natural language in terms of numeric computation. For example, long short-term memory (LSTM) and convolutional neural networks (CNN) were carried out for named entity recognition in biomedical context 5, 6. Deep learning approaches are increasingly adopted in medical research. The most widely used ML approach is the support-vector machine, followed by naïve Bayes, conditional random fields, and random forests 4. ML algorithms can be applied to text, images, audio, and any other types of data. The advances in machine learning (ML) algorithms bring a new vision for more accurate and concise processing of complex data.

However, this kind of approach is difficult to apply to complex data such as those in the pathology report and hardly used in hospitals. Rule-based algorithms have been selectively adopted for automated data extraction from highly structured text data 3. This approach is straightforward but not suitable for analysing the complex structure of a text and achieving high extraction performance. Several conventional keyword extraction algorithms were carried out based on the feature of a text such as term frequency-inverse document frequency, word offset 1, 2. As such, the data management of pathology reports tends to be excessively time consuming and requires tremendous effort and cost owing to its presentation as a narrative document. However, the extraction and generation of research data from the original document are extremely challenging mainly due to the narrative nature of the pathology report. As a document that contains detailed pathological information, the pathology report is required in all clinical departments of the hospital. All kinds of specimens from all operations and biopsy procedures are examined and described in the pathology report by the pathologist. The pathology report is the fundamental evidence for the diagnosis of a patient. The results demonstrated the suitability of our model for practical application in extracting important data from pathology reports.

Additionally, we applied the present algorithm to 36,014 unlabeled pathology reports and analysed the extracted keywords with biomedical vocabulary sets.
#Deep learning in nlp professional#
We compared the performance of the present algorithm with the conventional keyword extraction methods on the 3115 pathology reports that were manually labeled by professional pathologists. We considered three types of pathological keywords, namely specimen, procedure, and pathology types. In this study, we employed a deep learning model for the natural language process to extract keywords from pathology reports and presented the supervised keyword extraction algorithm. Keyword extraction for pathology reports is necessary to summarize the informative text and reduce intensive time consumption. However, the extraction of meaningful, qualitative data from the original document is difficult due to the narrative and complex nature of such reports. Pathology reports contain the essential data for both clinical and research purposes.
