Natural Language Processing and Understanding

WKKim 08/26/2024

Python AI

Overview

Natural Language Processing (NLP) refers to the set of mathematical models and computational algorithms capable of processing, understanding, and generating natural language either in written (a document) or spoken (an audio speech) format. Some popular examples of NLP applications are chatbots, voice assistants, automatic text translation, spam detection and sentiment analysis.

Natural language, the language spoken by humans, is complex due to the high ambiguity of meaning and grammatical roles of many words. Additionally, spoken language is much less structured than written text, and accents and regional particularities impact the understanding of different conversations. These challenges led to many advances in NLP, enabling computers to understand and generate text in a way similar to how humans do daily.

The exponential number of text data being produced every day on the internet, especially on social media, and the growing popularity of smart devices, such as Amazon Echo, are some of the driving factors leading to the increasing interest in NLP in recent years. Industries like Retail, Banking, Healthcare, and transportation are some examples of financial sectors benefiting from fast and efficient processing of large amounts of texts and audio.

I studied the key concepts, mathematical models and techniques, and applications of NLP. I also explored tools for developing NLP solutions and discussed the limitations and ethical challenges related to current NLP solutions. This enabled me to critically analyze the issues, problems, and potential opportunities pertaining to this rich research area.

An in-depth understanding of established techniques of NLP and its real-world applications.
Discuss some of the ethical issues and current challenges of NLP.
Comprehensive knowledge of the structure (morphological, synthetical, and semantical) of natural languages, with a special focus on English, together with techniques for processing these languages and extracting relevant features.
Establish a comprehensive and practical awareness of the techniques and metrics used for evaluating different NLP algorithms.
Provide a wide-ranging practical knowledge of the available libraries, programming languages, and platforms for the development of NLP solutions.

Topic

Natural Language Processing consists of the research and development of algorithms capable of understanding the structure and meaning of human language. Currently, NLP has a wide range of applications, such as Conversational Agents, Machine Translation and Sentiment Analysis.

One of the most recent applications of NLP is in the detection of hate speech and fake news in social media. Hate speech is any content that is targeted to a particular race, religion, or sexual orientation with the intention of threatening, abusing or provoking a minority social group. Fake news, on the other hand, consists of false or distorted information about certain topics intending to mislead readers. In recent years, the increasing spread of both fake news and hate speech in social media had severe impacts on many political and social sectors. Some examples of these issues are:

Fake news impacting the US elections (both 2016 and 2020[3]) and the current covid-19 vaccination campaigns (due to anti-vaccine movements on social media groups[4]).
Racist and fascist speech inciting violence against people of colour during the Black Lives Matterprotests[1] [2].
The benefits of hate speech and fake news detection in social media.
The main Natural Language Processing(NLP) solutions available for detecting both types of posts.
Your opinion on automatic hate speech/fake news detection. Do you agree, disagree, or is neutral (depends on context, applications, etc)? Explain your reasons.

References

[1] New hate crime record after racial attacks rocket during BLM protests (2021) The Independent.
[2] Kumar, S. and Pranesh, R.R. (2021) ‘TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter’, arXiv:2108.12521 [cs] [Preprint].
[3] Bovet, A. and Makse, H.A. (2019) ‘Influence of fake news in Twitter during the 2016 US presidential election’, Nature Communications, 10(1), p. 7. doi:10.1038/s41467-018-07761-2.
[4] Germani, F. and Biller-Andorno, N. (2021) ‘The anti-vaccination infodemic on social media: A behavioral analysis’, PLOS ONE, 16(3), p. e0247642. doi:10.1371/journal.pone.0247642.

Post

The possible positive outcomes of the detection of hate speech and fake news in social media are quite comprehensive and are very important when it comes to sustaining a decent online environment. To begin, identifying and suppressing hate speech can make the digital environment be more friendly and accept that all groups in society are equal. It has been known to contribute to the manipulation of antagonistic racial, religious, or sexual groups tensions into more serious conflicts leading to violence and discrimination. Using NLP solutions to detect hate speech on social media can enable quick identification and removal of such content before it does significant damage.

With the same regard, fake news detection plays an important role in preventing information spread as wrong information can have major repercussions. However, at crucial moments such as elections or pandemics, false information can lead many to adopt a generation change in public attitudes and undermine confidence in institutions, thus even threatening the lives of citizens (Beutel, Kirschler and Kokott, 2022). Significantly, NLP algorithms help discover patterns and linguistic clues that are indicative of fake news, which allows it to be quickly determined and removed from digital venues.

For the identification of hate speech and fake news, multiple NLP solutions have been designed. Parihar et al. (2021) talk about applications and issues detection of hate speech using NLP highlighting the significance of utilizing linguistic and context features for appropriate recognition. Then, as Zhang and Ghorbani (2020) claim – the overview of online fake news also reveals that sophisticated approaches should also be implemented in order to effectively detect them since linguistic features, propagating mechanisms, and user behavior need to be taken into consideration. These NLP solutions make use of machine learning algorithms or natural language processing techniques to examine textual information and detect the underlying patterns that represent hate speech or fake news.

In my opinion, automatic hate speech and fake news detection based on NLP would allow us to combat these issues in a timely manner and at scale. But only with balancing considerations. The features such as the volume and rate of distribution of content among social networks highlight that manual moderation is simply not feasible, which also proves in favor of automatic solutions. Nevertheless, the ability of these systems is in question through false positive and what may be termed as biasness that comes with algorithmic decision making. Finding a balance between free speech and content removal is difficult, but improving the accuracy of algorithms along with algorithmic narratives can help address these problems.

Examples from real-life cases further support the idea of automation detection. Impact of misinformation is evident in the fake news and how they had an influence on the democratic processes like manifested in the 2016 US presidential election (Bovet and Makse, 2019). In addition, the information on social media about anti-vaccination shots demonstrates that incorrect data will drive safety hazards for individuals (Germani and Biller-Andorno, 2021). These cases underline the need for efficient NLP solutions to curb the social cost of hateful content and conditioning. In sum, detecting hate speech and fake news is a necessary measure that enables to create an appropriate environment for online interactions without threats to life and health.

References

Beutel, I., Kirschler, O. and Kokott, S. (2022). How do fake news and hate speech affect political discussion and target persons and how can they be detected? Central and Eastern European eDem and eGov Days, 342, pp.37–81. doi:https://doi.org/10.24989/ocg.v.342.2.
Bovet, A. and Makse, H.A. (2019). Influence of fake news in Twitter during the 2016 US presidential election. Nature Communications, 10(1). doi:https://doi.org/10.1038/s41467-018-07761-2.
Germani, F. and Biller-Andorno, N. (2021). The anti-vaccination infodemic on social media: A behavioral analysis. PLOS ONE, [online] 16(3). doi:https://doi.org/10.1371/journal.pone.0247642.
Parihar, A., Thapa, S. and Mishra, S. (2021). Hate Speech Detection Using Natural Language Processing: Applications and Challenges. 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI). [online] doi:https://doi.org/10.1109/ICOEI51242.2021.9452882.
Zhang, X. and Ghorbani, A.A. (2020). An overview of online fake news: Characterization, detection, and discussion. Information Processing & Management, [online] 57(2). doi:https://doi.org/10.1016/j.ipm.2019.03.004.

Summary

Pondering on the peer and instructor conversations about the post, several important points came to light. One theme that has been repeatedly raised was the appreciation of the contribution of hate speech and fake news detection toward the building of a more sustainable digital ecosystem. In addition, there was also a consensus on the difficulties accompanying the possibility of having precision and fairness in the detections. Participants emphasized the need to consider cultural and linguistic subtleties, especially for minority groups, so as to not unwittingly make the NLP algorithms reinforce the existing biases. Moreover, discourses called for the continuous refinement and validation of detection models to address false positives and with aim of guaranteeing effective content moderation without violating the freedom of expression.

Another debate was also about the question of striking a balance between free speech and content moderation. Although automatic detection systems are scalable and perform efficiently, issues on false positives and algorithmic bias came out. The peers stressed on polishing NLP algorithms to limit the false positives and to reduce the biases which may impact minority groups disproportionately. Furthermore, a majority of the reviewed objectives indicated the importance of continuous monitoring and updating of detection systems to keep up with the changing techniques being used by the illicit actors (Parihar et al., 2021). Furthermore, the real-world instances, like the effect of misinformation on democratic norms and health, embraced the need to put in place sturdy NLP solutions to solve this problem. I got a more comprehensive picture of the multi-layered nature of hate speech and fake news detection, paying particular attention to the balance between technical knowledge and ethical issues.

Moving forward, the spotlight should be directed towards research on hate speech and fake new detection in social media by trying to eliminate bias and tracking hate sources. It concerns looking for new means to combine various linguistic and cultural overviews with the models during the detection which result in more inclusivity and better accuracy (Zhang & Ghorbani, 2020). Moreover, they should focus on the construction of Open and clear algorithms for accountability and establishing trust of users and stakeholders.

In addition to the algorithmic solutions, the fight against hate speech and fake news should also include the societal initiatives such as media literacy programs and community-based initiatives that foster critical thinking and responsible online behavior (Beutel et al., 2022). As a result of eliminating the main sources of misinformation and intolerance, we will get a resilient digital world that is based on democratic values and that guarantees the interests of the users. In the long run social media can be tactfully and systematically tackled to bring about the agenda of productive talks and eventual social reforms.

References

Beutel, I., Kirschler, O., & Kokott, S. (2022). How do fake news and hate speech affect political discussion and target persons and how can they be detected? Central and Eastern European EDem and EGov Days, 342, 37–81. https://doi.org/10.24989/ocg.v.342.2
Parihar, A., Thapa, S., & Mishra, S. (2021). Hate Speech Detection Using Natural Language Processing: Applications and Challenges. 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI). https://doi.org/10.1109/ICOEI51242.2021.9452882
Zhang, X., & Ghorbani, A. A. (2020). An overview of online fake news: Characterization, detection, and discussion. Information Processing & Management, 57(2). https://doi.org/10.1016/j.ipm.2019.03.004

Kaggle is one of the largest online communities for data scientists and machine learning practitioners. It enables users to find and publish datasets, explore and build Deep Learning models, and collaborate with other data scientists and ML engineers around the world. One of its more interesting features is the Machine Learning Competitions, which attract over a thousand teams and individuals every year. These competitions consist of problems posted by companies and/or research institutions in which teams compete to build the best algorithm.

Among the most popular Kaggle competitions is the Toxic Comment Classification Challenge. This challenge occurs annually since 2018 and consist of researching and developing methods for detecting and classifying different levels of toxicity in negative and disrespectful online comments, from insults to violent threats. In this assessment, we’ll be using the dataset from the first competition for analysing the performance of different MachineLearning methods for classifying 6 types of toxicity. Details on the dataset and instructions on how to use are available here: Toxic Comment Classification Challenge.

Tasks

Perform detailed data analysis of the dataset provided by the competition, observing:
1. Number of sentences and tokens per class (and check if the dataset is unbalanced or not).
2. Analyse the most common words for each class and, therefore, understand the most used terms for each level of toxicity.
Select three Machine Learning algorithms among the ones listed below(many of these methods were explored in this module and previous ones):
1. Support Vector Machine (SVM)
2. K-Nearest Neighbours (KNN)
3. NaïveBayes
4. Decision Trees
5. Logistic Regression
6. Random Forest
7. Multi-LayerPerceptron
Analyse their performance in classifying the level of toxicity of different comments. Please make use of the main metrics (accuracy, F1-score, Recall, Precision, and AUC) to compare the different algorithms. Additionally, clearly explain the parameters defined for each model. Any MLpython Library can be used during implementation(such as sklearn and keras).
Consider the main Feature Extraction methods studied in previous Lectures, such as TF-IDF and WordEmbeddings. Using the same three classifier previously analysed, change the Feature Extraction method initially used (for example, if you used Word Embeddings, change to TF-IDF) and repeat the previous experiments and observe if there is any considerable difference between the new results and the previous one (i.e. if the method of feature extraction impacts the classification performance).

※ Python Code

Topic

RNNs are one of the most useful neural network architectures for solving various NLP problems.

Compare and contrast the application of different RNN architectures for solving various NLP problems.

Compare and contrast supervised and unsupervised machine learning methods in general, whilst making specific reference to at least two ML modelling techniques in each of the two broad classes. Your comparison should highlight strengths, weaknesses, and challenges for each of the ML modelling techniques mentioned in your post.
Draw up structured criteria that can be used in determining the most appropriate ML approach for developing a machine learning solution to a particular problem.

Post

What I truly cherish is how clearly you have explained the concept of one-to-many and many-to-one RNN architectures detailed in the text about image captioning and sentiment classification. The first two examples showcase the ability of RNNs to process sequential data and to produce output that is meaningful. Image captioning based on the use of LSTM is viable, as this technique helps in addressing the problem of caption generation of variable length that exists due to the complexity of the input image (Smagulova and James, 2019).

Indeed, you were quite right that RNNs are accompanied by some challenges, e.g., vanishing and exploding gradients. An overwhelming gradient is the obvious reason for obstructed vision and this is confirmed with your image captioning example. This particular challenge might impair the model greatly, in terms of accurately define and explain the content in the image (Dadoun and Troncy, 2020). The association between sentiment categorization and grammatical faults which arise due to an explosion of gradients clearly reveals the need for attention paid to the optimal parameter selection and gradients-related problems in NLP systems. It becomes important for the providers to be aware about these pitfalls and usability of techniques like gradient clipping for preventing them.

Besides the challenge you mentioned about the training process of RNNs also entails high calculation cost which is another problem. Training RNNs is a quite computer-extensive procedure, particularly with vast datasets, and is a persistent issue of balancing between complexity of model and computational efficiency (Gori et al., 2009). Overall, this posts a detailed account of how the major types of RNN architectures are put to use in the NLP domain and the commonly encountered problems. It can serve as the foundation for the dialog about these problems and can help to enhance the resistance of RNN systems in practical applications.

References

Dadoun, A. and Troncy, R. (2020). Many-to-one Recurrent Neural Network for Session-based Recommendation. [online] arXiv.org. doi:https://doi.org/10.48550/arXiv.2008.11136.
Gori, M., Hammer, B., Hitzler, P. and Palm, G. (2009). Perspectives and challenges for recurrent neural network training. Logic Journal of the IGPL, 18(5), pp.617–619. doi:https://doi.org/10.1093/jigpal/jzp042.
Smagulova, K. and James, A.P. (2019). A survey on LSTM memristive neural network architectures and applications. The European Physical Journal Special Topics, 228(10), pp.2313–2324. doi:https://doi.org/10.1140/epjst/e2019-900046-x.

Summary

The approach concentrated mainly on the one-to-many and many-to-one RNN models, and the discussions resulted in priceless guides that have narrowed down and increased my knowledge base on their use cases.

The first conclusion that I managed to draw from discussions was that the architecture of RNN should be the most appropriate depending on the NLP task. The many-to-one RNN model that I have identified early is overwhelmingly exemplified as being the best choice for the task of sentiment analysis. Peers specifically drew the attention of processing sequence of words, giving a single sentiment value, especially for applying in movie review sentiment classification (Tembhurne & Diwan, 2020). Also, the class conversations bring into the shadow the intricacies associated with the many-to-one model, including the problem of vanishing gradient. The addition of Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) cells as a method to solve the problem of long-term dependencies and handle information flow was stressed as a significant approach to overcome these challenges (Liu et al., 2024).

On the other hand, the one-to-many RNN architecture with narrow focus on such tasks as language generation and text summary attracted lots of attention. In this situation the example of image captioning, via which the model generates a textual description of an image, was highlighted as a particularly important application of this architecture (Hu et al., 2022). On the other hand, the stumbling block of the exploding gradient remains relevant for one-to-many RNNs and is observed mostly during the initial phase of training. Techniques such as gradient clipping, prudent initializing of parameters, the use of regularization techniques that dropout to tackle these problems. The addition of sophisticated optimization algorithms such as RMSProp or Adam not only increases the ability to smooth the training but also provides a stable mode.

Alongside that, the discussions time allowed me to gain more knowledge on other strategies to strengthen the RNN models in NLP tasks. Techniques such as group normalization as well as slope clipping were suggested to help in achieving stability of training, calculation of robust gradients, and smooth convergences (S et al., 2021). The underlying idea that parameter initialization, regularization, as well as the model's architecture modification can lead to the success in RNN applications emphasized the collaborative approach to problem-solving.

Reviewing the classes' discussions convinced me on the essentiality of RNNs in NLP while also giving me a fascinating comprehension of the different structures' complexities. The insight I gathered made me revisit my strategy, and that is what made me see the necessity of being selective at the time of choosing RNN models and using diverse techniques to address particular problems. The versatile nature of the group discussions where case studies and diverse opinions are encouraged, made the learning experience more colorful and the comprehension of the application of RNN in NLP more broad.

References

Hu, H., Zhu, X., Zhou, F., Wu, W., Hu, R. Q., & Zhu, H. (2022). One-to-Many Semantic Communication Systems: Design, Implementation, Performance Evaluation. IEEE Communications Letters, 1–1. https://doi.org/10.1109/lcomm.2022.3203984
Liu, X., Li, Z., Tang, Z., Zhang, X., & Wang, H. (2024). Application of Artificial Intelligence Technology in Electromechanical Information Security Situation Awareness System. Scalable Computing: Practice and Experience, 25(1), 127–136. https://doi.org/10.12694/scpe.v25i1.2280
S, L., Ekdeep, Robert, D., & Hidenori, T. (2021). Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning. Advances in Neural Information Processing Systems, 34. https://proceedings.neurips.cc/paper/2021/hash/2578eb9cdf020730f77793e8b58e165a-Abstract.html
Tembhurne, J. V., & Diwan, T. (2020). Sentiment analysis in textual, visual and multimodal inputs using recurrent neural networks. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-020-10037-x

Implement a seq-2-seq model with and without attention mechanism for developing a generative chatbot using one of the datasets listed below.

Datasets to be utilized for developing chatbots

Question Answer Datasets

The WikiQA Corpus was made publicly available in 2015, and has been updated several times since its inception. It contains different sets of question and sentence pairs that were originally collected. 
Question-Answer Dataset. This chatbot dataset was designed for use in Academic research, and features Wikipedia articles alongside manually-generated factoid questions that come from them. It also features manually-generated answers to the aforementioned questions.

Customer Support Datasets

Ubuntu Dialogue Corpus: Consisting of almost one million two person conversations that have each been taken from the Ubuntu chat logs, this dataset is perfect for training a chatbot. It contains 930,000 dialogues spanning 100,000,000 words.

Tasks

Develop chatbot based on the following models using one of the chosen datasets.
1. Seq-2-seq model without attention
2. Seq-2-seq model with attention – Can use either Bahdanau or Luong Attention mechanism in the project
Carry out necessary pre-processing tasks to prepare the data
Split the dataset into appropriate Train/Validation/Test sets
Use the Validation set to train the model
Evaluate model on Test set
1. Assess the performance of the two models in terms of their accuracy or BLEU score
2. Manually evaluate (on a smaller subset), answers generated by chatbot

Figure 1. Analysis flowchart

Figure 5. WoA model.

Figure 6. WA model.

Though the models did not perform as intended, they provide a solid foundation for re-finement. Thus, the missing end-of-sentence tags are included in the preprocessing pipe-line, and the issue with the matrix transformation for the WA model is corrected. There is confidence that the chatbot will perform as intended.

※ Python Code