date: 2024-12-13
title: BDA-Project-Report
status: DONE
author:
  - AllenYGY
tags:
  - Project
  - Report
  - BDA
publish: True

BDA-Project-Report

Abstract

Text classification is the bedrock of natural language processing (NLP) activities, like content moderation, spam detection, and sentiment analysis. This report focuses on an innovative, well-designed system for identifying illegal texts and filtering high-quality comments, which combines traditional machine learning (ML) methods and novel deep learning (DL) techniques to attain the best performance. The illegal text detection module incorporates sparse rule-based sensitive word filtering in addition to FastText's rapid filtering and BERT for final classification, ending with a balance between processing power and accuracy. Subjects used as experimental subjects indicate that the deep learning methods, such as BERT and FastText, excel in comparison with traditional ML models, which are Multinomial Naive Bayes, Random Forest, and XGBoost, which give almost perfect accuracy and F1-scores. The operating principles of the high-quality comment filtering part are the utilization of a pretrained BERT model to get token embeddings and the trained Auto-Encoder for the reconstruction-based quality assessment. Evaluation is done at a standard of reconstruction errors to differentiate high-quality comments, and this is through a threshold-based method that gives reliable filtering. The experimental analysis focuses on the model's two-stage hybrid procedure for illegal text detection and concludes that the Auto-Encoder captures the content features for which it was trained properly.

Introduction

As digital platforms are expanding rapidly, the supervision of quality and legality of user content on them is one of the problems that needs immediate attention. Text classification, as one of the core features of natural language processing (NLP), is one of the most important solutions to this challenge. The use of text classification in applications like spam detection, sentiment analysis, and content moderation has a considerable reliance on its precision and speed. On the contrary, textual data's heterogenic nature, context variation, and semantic complexity are the major issues faced by conventional machine learning (ML) techniques.

Since traditional ML techniques require significant feature engineering to process text, some of the methods used are Multinomial Naive Bayes (MNB), Support Vector Machines (SVMs), and ensemble methods like Random Forest and XGBoost. Unlike these models, which perform their best with smaller datasets and comparatively simple tasks, they cannot cope with the deep semantic and contextual dependency that complex text data has.

On the contrary, DL approaches, in particular, such as FastText, Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), the Transformer architecture, and BERT, which have introduced a leap over the original text classification methods. These models avoid the manual feature construction process and leverage word embeddings and attention mechanisms to adaptively learn context-sensitive models of text. Among these, BERT has shown the best performance by incorporating context preservation, alongside RNNs.

In this report, we propose a comprehensive system designed for two primary tasks: illegal text detection and high-quality comment filtering. The system employs a hybrid approach that combines rule-based methods, lightweight DL models (FastText), and robust transformer-based models (BERT). Specifically:

Illegal Text Detection: A two-stage process that uses FastText for fast inference and BERT for precise classification. This approach balances computational efficiency and accuracy while handling large-scale text data.
High-Quality Comment Filtering: The system uses BERT embeddings and an Auto-Encoder to assess reconstruction errors and distinguish high-quality comments from low-quality ones.

The experimental evaluation compares traditional ML methods with DL techniques. Results show that DL models, particularly FastText and BERT, significantly outperform ML models in accuracy, precision, recall, and F1-score. The proposed Auto-Encoder further ensures effective filtering of high-quality comments based on reconstruction error thresholds.

This report is structured as follows:

Section 2 discusses related works and existing text classification techniques.
Section 3 details the methodology for illegal text detection and high-quality comment filtering.
Section 4 presents the experimental results and analysis.
Section 5 outlines future work and concludes the report.

By integrating advanced deep learning methods with a hybrid filtering pipeline, the proposed system demonstrates its potential for scalable, efficient, and accurate text moderation in real-world applications.

Related works

Text classification has been a pivotal task in natural language processing (NLP) and has seen substantial progress over the years. The methodologies for text classification can be categorized into two primary paradigms: traditional Machine Learning (ML) approaches and Deep Learning (DL) techniques. Each of these paradigms has evolved to address various challenges posed by textual data, such as high dimensionality, context dependency, and variability in text length and semantics.

Machine Learning Methods

Traditional ML approaches usually rely on strongly featured engineering and statistical strategies. In this context, Multinomial Naive Bayes (MNB) is probably one of the first and simplest processing techniques. MNB is a theorem of Bayes that processes with conditional independence of the features or attributes given the class label or the class category. Even if this assumption mostly doesn't hold for large data sets of the real-world case, MNB's operational efficiency and great usefulness for tasks like spam filtering and sentiment analysis on small data sets have made it popular among users. These drawbacks, however, stem from the model's inability to take into account the feature interdependence.

Support Vector Machines (SVMs) are the other key latest development of conventional ML for text classification. The main feature of SVMs is the ability to identify a hyperplane that separates the data with the greatest margin, which is known as a large margin classifier, based on the dimensionality of the feature space. Through kernel tricks, SVMs get the upper hand in case of non-linear decision boundaries, thus they can be used with text data having drastic multi-patterns. SVMs have the advantage when it comes to the models produced by combining features represented by TF-IDF, for instance, in the domain of document categorization and sentiment analysis. Nevertheless, their large computational cost and too much parameter tuning make it impossible to work with large-scale datasets.

Finally, the Random Forest method, as well as XGBoost and boosting methods such as LightGBM, can perform well as well in text classification. The Random Forest algorithm uses a tree ensemble learning algorithm that, through the selection of the best split and the repeated combinations of trees, helps to generalize better than single decision trees; while the algorithms of boosting, named now XGBoost and LightGBM, work by improving weak classifiers in a sequential fashion so to reduce the error. On the other hand, document ranking and biomedical text classification were the areas these methods were primarily applied in the structured text data applications. Despite their advantages, these procedures heavily depend on the feature engineering and can be found tricky if the data is sequential and context-sensitive.

Deep Learning Methods

Despite the traditional ML techniques, deep learning (DL) methods have altered text classification by obtaining feature representations from the text itself instead of hand-crafting them. With the introduction of FastText by Facebook, a simple yet effective machine learning algorithm for text categorization has been developed. It reflects words as dense vectors (word embeddings) and averages them to create document representational vectors. Furthermore, subword modeling helps to tolerate spelling mistakes and morphological variants, which are frequent in the case of low-resource languages. The downside of FastText is its straightforward approach, which may be a drawback when it comes to complex relationships between words.

Recurrent Neural Networks (RNNs) and their variations, such as Long Short-Term Memory (LSTM) networks, have incorporated sequential modeling capabilities into text categorization. LSTMs somewhat mitigate the vanishing gradient problem existing in conventional RNNs and maintain long-term dependencies. Consequently, they are viable candidates for applications which call for context comprehension, such as sentiment analysis or question-answering tasks. On the other side, RNNs and LSTMs read texts one after another, which elaborates on performance and the cost of computation and the time of training for big data or long sequences.

CNNs, developed primarily for image processing, have also demonstrated their capacity of being employed for text classification. In this way, by treating sentences as one-dimensional sequences where local features are extracted – these features can be n-grams via convolutional filters. Thus, they are probably the best in this class where it's essential to consider the local context, i.e., short text classification or sentiment analysis. CNNs are more efficient computationally, achieving increased parallel processing, yet they may still struggle to encode hints of long-term text dependencies.

The development that caused the greatest disruption in the text classification concept was through Transformer-based models, especially BERT, which is the Bidirectional Encoder Representations from Transformers. BERT introduced a new type of language modeling and used the transformer architecture which is self-attentional mechanisms. Differently from the previous models, BERT unifies the initial context with opposite directions, which results in a more profound knowledge of the relations within a sentence. BERT proves to be the most efficient model for various NLP tasks, such as text classification, sentiment analysis, and Named Entity Recognition (NER), by virtue that it has been pretrained on a huge dataset. It has an enormous capacity for consecutive learning and retraining, so it has become one of the most applicable models. Nevertheless, the model's weight and the high requirements of computational loads prevent small organizations and low-lifting environments from having visible results.

While ML methods such as MNB and SVM provide computational simplicity and efficiency, they still hugely depend on feature engineering and therefore tend to lose semantics and contextuality. In contrast, the DL approach, in particular that of transformer-based models, like BERT, champions this but it requires enormous computation power and huge amount of data for training. Accordingly, the methodology employed is often tailored to the task characteristics, like size of data, computation resources, and level of the text complexity. The combination of classical ML techniques and the recent innovations in DL methods have greatly extended the research horizons of text analysis, making the text classification area more dynamic and captivating.

Method	Advantages	Limitations	Best Use Cases
Multinomial Naive Bayes	Simple, fast, effective for small datasets	Independence assumption may not hold	Spam filtering, basic sentiment analysis
Support Vector Machines	Works well with high-dimensional data	Computationally intensive for large datasets	Review categorization, topic modeling
Random Forest	Handles noisy and imbalanced data	Limited interpretability, not optimal for sparse data	Biomedical text classification, domain-specific tasks
Logistic Regression	Easy to implement and interpret	Fails on non-linear relationships	Benchmarking, small-scale applications
FastText	Fast and scalable, robust to misspellings	Does not capture long-range dependencies	Low-resource or domain-specific tasks
LSTM	Captures sequential dependencies, suitable for long texts	Slow training, struggles with very long sequences	Sentiment analysis, question answering
textCNN	Efficient, captures n-gram features effectively	Fails to model long-term dependencies	News classification, short text classification
BERT and Variants	State-of-the-art accuracy, handles long texts with global context	High computational cost, requires large-scale pretraining	High-stakes tasks like medical or legal text classification

Methodology

The entire process is structured as follows:

Input text undergoes sensitive word matching and FastText screening.
Texts flagged by both methods are classified as illegal.
Conflicting cases are resolved by the BERT classification model.
Non-illegal texts proceed to the high-quality comment filtering stage.
Token embeddings are generated using pretrained BERT.
An Auto-Encoder calculates reconstruction error to assess quality.
Texts with acceptable reconstruction errors are deemed high-quality comments.

The proposed system consists of two main components: illegal text detection and high-quality comment filtering, designed to operate sequentially for robust content moderation. Each component is trained and evaluated independently, ensuring flexibility and scalability. Below, we describe the methodology for each stage in detail.

1. Illegal Text Detection

The first stage is illegal text detection, which acts as a filtering point. The goal is to explore the way of finding and classifying the texts with the sensitive or illegal information. This hybrid stage combines rules-based systems and machine learning for a better precision.

Sensitive Word Matching:
The text is analyzed, being driven by employed rules, via a sensitive term-matching approach. There exists a given set of illegal or sensitive vocabulary that is employed to flag text that contains this vocabulary. Such an approach is computationally efficient and interpretable. However, it may generate spurious results or fail to find texts that have different wording but similar expressions.
FastText Model Screening:
At the same time, the input is fed to the new model, named FastText. Developed as a lightweight deep learning model employed by Facebook, it consists of a combination of word embeddings, representing text, as well as the ability to efficiently predict the probability of a piece of text being illegal. This characteristic will help maintain a stable position in spite of spelling variations and slight modifications of the text.
Fusion of Results:
To add another reliability supporting factor, the system adopts the output from both the sensitive word match and FastText screening methods. The decision algorithm can be explained this way:
- If both methods flag the text as illegal, it is directly classified as illegal.
- If both methods agree that the text is not illegal, it is passed to the next stage for high-quality filtering.
- If the results conflict (e.g., one flags illegal and the other does not), a BERT-based classification model is invoked to make the final determination.
BERT Model for Conflict Resolution:
The BERT (Bidirectional Encoder Representations from Transformers) model leverages its deep contextual understanding of the text to resolve conflicts. BERT generates a bidirectional representation of the input text and uses fine-tuning to classify whether the text is illegal or not. This step adds a layer of robustness, particularly for texts with ambiguous or context-dependent content.

2. High-Quality Comment Filtering

For texts classified as non-illegal by the previous stage, the system proceeds to evaluate their quality through the high-quality comment filtering process. This stage involves the use of a pretrained BERT model for token embedding generation and an Auto-Encoder for reconstruction-based quality assessment.

Token Embeddings Generation with BERT:
The input text is first tokenized and converted into token embeddings using a pretrained BERT model. The [CLS] token, which serves as an aggregate representation of the entire text, is also included in the matrix. The output is a multi-dimensional matrix where each row corresponds to the embedding of a token in the input sequence. This representation captures both semantic and syntactic information from the text.
Auto-Encoder Reconstruction:
The BERT-generated token embeddings are fed into a trained Auto-Encoder, a type of neural network used for unsupervised learning. The Auto-Encoder architecture consists of:
- Encoder: Compresses the high-dimensional input embeddings into a latent-space representation, reducing noise and dimensionality.
- Decoder: Reconstructs the input embeddings from the latent representation, aiming to minimize the reconstruction error.
Reconstruction Error Calculation:
After processing the embeddings, the system calculates the reconstruction error between the original input embeddings and the reconstructed output. The reconstruction error serves as a measure of how well the text aligns with the characteristics of high-quality comments. The rationale is that high-quality texts are more likely to be accurately reconstructed due to their alignment with the training data patterns.
Quality Determination:
The system applies a predefined threshold to the reconstruction error:
- If the error falls within the acceptable range, the text is classified as a high-quality comment.
- If the error exceeds the threshold, the text is discarded, as it does not meet the quality standards.

This comprehensive two-stage pipeline ensures robust filtering of illegal content while simultaneously identifying high-quality comments for further use. By combining rule-based methods, traditional deep learning (FastText), and advanced transformer-based techniques (BERT), the system achieves a balance of efficiency, accuracy, and scalability.

Experimental study and Result analysis

Illegal Text Detection: Experimental Results and Model Selection

In our project, we conducted extensive experiments to evaluate the performance of Machine Learning (ML) and Deep Learning (DL) models for illegal text detection. The results demonstrate that Deep Learning methods significantly outperform traditional ML approaches, leading to the selection of a two-stage hybrid approach combining FastText and BERT. This strategy achieves a balance between computational efficiency and classification accuracy.

Experimental Results

The evaluation metrics for both ML and DL models, including Accuracy, Precision, Recall, and F1-Score, are presented in the following tables:

Machine Learning Models Performance

Model	Accuracy	Precision	Recall	F1-Score
MultinomialNB	0.5461	0.5427	0.5461	0.5441
Random Forest	0.5899	0.5905	0.5899	0.4942
XGBoost	0.5847	0.5672	0.5847	0.5165
LightGBM	0.5784	0.5548	0.5784	0.5086
SVM	0.5680	0.5491	0.5680	0.5423
Logistic Regression	0.5600	0.5413	0.5600	0.5375

Deep Learning Models Performance

Model	Accuracy	Precision	Recall	F1-Score
FastText	0.99	0.99	0.985	0.985
BERT	0.99	0.985	0.99	0.99
LSTM	0.99	0.99	0.985	0.99
TextCNN	0.98	0.985	0.985	0.985

From these results, we observe the following:

Deep Learning models like BERT, LSTM, and FastText achieve near-perfect accuracy and F1-scores (~0.99), far outperforming traditional ML methods.
Among ML methods, Random Forest achieves the highest accuracy (0.5899), but the overall F1-scores remain relatively low (<0.55), indicating limitations in capturing complex textual patterns.
While BERT achieves the highest performance, its computational cost makes it less efficient for large-scale text filtering.

Model Selection: Two-Stage Hybrid Approach

Considering the experimental results, we adopted the two-stage hybrid model, which consists of FastText and BERT, suitable for illegal text detection. This is done by benefiting from the strong features of both approaches.

FastText for Initial Filtering:
Yes. One prominent advantage of using FastText for initial text filtering is that it allows for extremely fast text processing. Thus, it is very bloodthirsty when it comes to time utilization and resources. It categorizes big pieces of input texts into "most likely" illegal or "most likely" non-illegal segments. FastText, on the other hand, runs through a large chunk of the data at once, which makes the computational process lighter on the next model.
BERT for Final Decision-Making:
Mention of texts that seem ambiguous or seem illegal would be processed by BERT for further analysis. Precision of sentiment classification is guaranteed by BERT, even in the most ambiguous or borderline cases, because BERT utilizes deep contextual understanding.

Workflow of the Two-Stage Approach

Non-suspicious content appears to be filtered through FastText rather than being analyzed using other techniques like, for example, k-nearest neighbors.
FastText recognizer shown sections that are either "probably illegal" or have mixed readings are examined in further detail using BERT.
The last statement of BERT determines the final classification and takes into account all logical facts and other aspects.

Computational efficiency helps because most of the texts with trivial information will be filtered out during BERT's initial stages, and thus, the workload for BERT will be significantly reduced.

High accuracy: Text detection using BERT is highly accurate, meaning it minimizes the number of false positives and overlapses of illegal contents.

Scalability: This hybrid method is applicable in a variety of highly business-critical real-time scenarios, which need concurrent high throughput and low latency.

A balanced trade-off: The proposed system employs the dual strategy of combining fastness (FastText) with robustness (BERT) to achieve optimal performance in the case of large-scale text detection problems.

We arrive at a solution which is the hybrid of FastText and BERT, which allows us to achieve optimal balance between computational efficiency and classification precision for real-world applications requiring a combination of both speed and accuracy.

High-Quality Comment Filtering: Reconstruction Error Analysis

This section details the Auto-Encoder architecture used for filtering high-quality comments and provides an analysis of its experimental results, including training performance and reconstruction error evaluation.

Model Architecture

The Auto-Encoder consists of two main components: an Encoder and a Decoder. The architecture is designed to compress and reconstruct input text embeddings generated by BERT. The detailed configuration is as follows:

Encoder:
- Composed of 2 fully connected layers activated by ReLU.
- Reduces the dimensionality of input embeddings into a compressed representation vector of size 32 dimensions.
Decoder:
- Symmetrically mirrors the encoder with 2 fully connected layers activated by ReLU.
- Reconstructs the original input embeddings from the compressed vector.
Input Representation:
Text data is tokenized and embedded into a matrix format using BERT token embeddings, capturing both semantic and syntactic features. The Auto-Encoder then learns to reconstruct the input embeddings.

Training and Validation Loss: The curves of the training and validation loss validate that the Auto-Encoder accomplishes convergence for 200 epochs. The losses drop drastically in the first epochs, with a further drop to about 0.1 at the end, showing that optimal learning of the embeddings takes place, resulting in the small error achieved.

Observations:
Training loss (blue) and validation loss (orange) are indicating a monotonically decreasing curve without significant overfitting, thereby resulting in a noticeable improvement of the overall model performance.
The minuscule discrepancy between training and validation loss substantiates strong generalization.

Reconstruction Error Distribution: The distribution of reconstruction error would provide knowledge of the system inquiring whether it can make proper division between high-quality comments and low-quality comments. The histogram below highlights key observations:

Reconstruction Error Threshold:
A red dashed line at the 95th percentile of the reconstruction error distribution would be used as a cutoff for discriminating between normal and abnormal reconstructions. Comments with mistakes below the threshold are determined as being of higher quality, while those that are over the threshold are considered defective.
Error Distribution:
Many comments fit the 0.50-0.55 range of reconstruction errors, which supports the claim that the model can reconstruct high-quality content. In contrast, a small set of comments with reconstruction errors above 0.56 show low quality, as they deviate from the learned latent representations.

Future work and conclusion

Validation of Effectiveness

Since the current system operates with an unsupervised model for high-quality comment filtering, it is challenging to directly verify its effectiveness without labeled data. This limitation creates a gap in understanding how well the Auto-Encoder distinguishes between high- and low-quality comments in practical use cases.

Possible Solution:
To address this, the system can be deployed in a real-world application environment where user interaction and behavior are monitored. User feedback can be collected to evaluate the system's accuracy and reliability. Metrics such as user satisfaction rates, content retention rates, and manual moderation corrections can serve as indicators of the model's effectiveness. Combining this feedback with semi-supervised fine-tuning on partially labeled datasets can further enhance validation.

Improving the Auto-Encoder Architecture

The Auto-Encoder, as it is shaped at present, saturates full-connected layers for both its encoder and decoder placement. While the framework is made of simple anatomy, there is a great chance of not embracing all the complex relationships and patterns existing in the texts, especially in cases like lengthy and very intricate formats. The following improvements are proposed:

Inclusion of Convolution Layer:
The convolutional layers that can be integrated into the Auto-encoder model have the potential to strengthen its ability to identify spatial and temporal relationships among the token representations. Due to their efficacy in recognizing n-gram features and local dependencies in texts, the convolutional layers can contribute to obtaining more accurate content reconstruction and a better differentiation of a low or high-quality content.
Utilizing Transformer-Based Architectures:
Build on Transformer-based architectures like BERT and GPT using advanced methods from the latest NLP tools for fine-tuning the Auto-Encoder. Through the application of self-attention methods, Transformers can obtain global context as well as long-range dependencies between objects in the sequences involved. I intend to use this method, which will steer me to a more in-depth feature extraction that is capable of closer semantic reasoning.
Incorporating Variational Autoencoders (VAEs):
Variational Auto-Encoder (VAE) can be the best solution given it is capable of generalizing and can also reveal latent designs in the data. This feature is an improvement on traditional Auto-Encoders as they introduce a probabilistic distribution in the latent space, which allows the model to plan and model better the actual textual patterns specifically for the variations. This form of refinement will, away, allow the system to have an easy time handling noise or ambiguity of the input while also increasing the quality of reconstruction.
Hybrid Models with Attention Mechanisms:
Importing attention into Auto-Encoder and putting them in together can lead to lower attention on less important tokens or features than on the more important ones within the reconstruction process. Attention filters of this kind, which are present in the Transformers, make the model focus on the priorities of the text parts against the importance of each part, which brings on the more regards-based exactness and era-based reconstructions.

Future Experiments and Evaluation

To validate the proposed improvements, the following steps can be undertaken:

Benchmarking Architectures: Carry out trials where the new Auto-Encoder architectures (such as convolutional Auto-Encoders, Transformer-based Auto-Encoders, and VAEs) are juxtaposed against the existing ones and evaluate their performance. For evaluation, metrics like reconstruction error, precision, recall, and F1 score could be used as indicators.
Data Augmentation: Employ data augmentation strategies to create a set of different versions, both high- and low-quality, to be used in training and testing. Techniques like back-translation, paraphrasing, and noise injection have proved effective in producing robust models.
Semi-Supervised Learning: Evaluate semi-supervised methods, which can utilize a small labeled dataset to aid in the generalization of being supervised by what is being learned. A method of using pseudo-labeling and consistency regularization can be applied as a consistency to improve the model's prediction.
Human-in-the-Loop Validation: Add human moderators to the system and evaluate human comment filtration to assess the quality of the filtered comments. Continuous and repeatable feedback loop could be used for tuning the model and aligning it with real-world expectations.

Consequently, enhancement measures will strengthen the system to enable it to handle illegal text detection and filtering of quality comments effectively. The integrated form of modern architectures and real-time validation will surely eliminate existing boundaries and facilitate system capabilities.

References

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://arxiv.org/abs/1607.04606
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://arxiv.org/abs/1810.04805
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.org/abs/1408.5882
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). https://arxiv.org/abs/1603.02754
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6114
Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. https://arxiv.org/abs/1901.11196