Machine Learning

WKKim 08/26/2024

Python AI

Overview

Machine Learning (ML) has attained greater academic and commercial significance in recent years. As organizations seek to extract actionable information and greater insights from data, the ability offered by Machine Learning to enhance and automate such processes has gained traction.

The commercial promise of Machine Learning has caused the emergence of several key concepts, methods and practitioner toolkits. Whilst initial Machine Learning drivers stemmed largely from artificial intelligence considerations as well as developments with visualization technology, the paradigm itself continues to shift inexorably towards greater levels of complexity and utility. Aspects such as “Deep Learning” and “Natural Language Processing (NLP)” have attracted increased attention worldwide.

I studied the key concepts, theories, techniques, and practices that underpinned Machine Learning. I looked at the main architectural, usage, and deployment models that characterized Machine Learning.

Provide you with an in-depth understanding of established techniques of machine learning, its real-world application and the legal contexts in which machine learning operates.
Provide you with comprehensive knowledge of the nature of data and the mechanism that may be used to pre-process data to support machine learning activities.
Establish a comprehensive and practical awareness of the techniques and metrics used to evaluate machine learning algorithms.
Furnish you with an in-depth and critical knowledge of a range of established approaches to machine learning, including their statistical and mathematical underpinning.
Provide a wide-ranging practical knowledge of an established machine learning workbench.

Topic

Machine Learning systems typically offer businesses several distinct advantages. The benefits include improved agility with knowledge extraction from data, enhanced adaptability and better modeling capabilities. A number of ML algorithm types and deployment patterns are available, and organizations have to make appropriate selections, to ensure that they make optimal use of Machine Learning.

Research and reflect on specific ML algorithm types and deployment patterns that have been used by known organizations.

Research online sources, including news articles related to the history of robotics solving world problems.
Reflect on your personal experiences on design of a robot that solves one of the world issues will look.
discuss the Machine Learning algorithm types and deployment patterns that have been used by a known commecial organization.

Post

Netflix actively leverages machine learning in the development and utilization of its recommendation system. Specifically, to enhance its search system, Netflix dynamically constructs a real-time adaptive recommendation system by harnessing user interactions within the current session. Through this, it predicts user needs and provides improved recommendations (Bhattacharya & Lamkhede, 2022). Collaborative Filtering and Content-based Filtering algorithms are employed to analyze users' viewing habits and recommend movies and TV programs (Netflix et al., 2016).

Netflix actively applies principles related to Continuous Deployment. The company is dedicated to automatically deploying new code changes to its core services after subjecting them to testing and validation. In pursuit of this, Netflix has established a large-scale automation and testing infrastructure, ensuring both stability and efficiency (Amazon et al., 2012).

Machine Learing Algorithms in Recommendation Systems

Collaborative Filtering

Advantages

Data-Driven: It generates personalized recommendations based on diverse user behavioral data.
Scalability: The scalability of collaborative filtering methods is enhanced by efficient matrix factorization, allowing for the management of large-scale user-item interactions (Koren, Bell, & Volinsky, 2009).

Disadvantages

Cold Start Problem: While collaborative filtering can address cold start issues, it still faces difficulties in recommending for new users or items with extremely limited interactions. (Koren, Bell, & Volinsky, 2009)
Computational Costs: In cases with a large number of users or items, operations like similarity calculations can become complex.

Continuous Deployment

Continuous Deployment is a methodology in software development and deployment that automates the process and rapidly delivers software through frequent updates. While this approach enables efficient development and deployment, it does have some limitations. Below are approaches to overcome these limitations

Limitation

Balancing Complexity and Stability: The continuous changes in software development and deployment can increase complexity and lead to unforeseen issues, challenging the balance between stability and reliability. (Amazon et al., 2012).

Mitigation Approach

To maintain stability and reliability, prepare contingencies for failures and recovery strategies in advance. Detect and address issues early through automated testing and monitoring. (Amazon et al., 2012).

Comparison between Netflix and its competitors

Let's examine the comparison between Netflix and its competitors in terms of the utilization of Collaborative Filtering and Content-based Filtering algorithms, as well as distribution patterns leveraging distributed systems and cloud infrastructure.

Strengths of Netflix

Personalization and Recommendation Accuracy

Netflix possesses the capability to offer personalized recommendations by employing a variety of algorithms. By combining Collaborative Filtering and Content-based Filtering, it provides accurate recommendations considering users' past viewing history, preferences, and item characteristics.

Weaknesses of Netflix

Diversity and Exploration Issues: Netflix's recommendation system tends to rely heavily on users' past viewing history, which can make it difficult to discover new genres or diverse content.

References

Amazon, J.R. et al. (2012) 'Resilience engineering: Learning to embrace failure: A discussion with Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli', Queue, 10(9), Available at: https://dl.acm.org/doi/10.1145/2367376.2371297
Bhattacharya, M. and Lamkhede, S. (2022) 'Augmenting Netflix search with in-session adapted recommendations', arXiv.org, Available at: https://arxiv.org/abs/2206.02254
Netflix, C.A.G.-U. et al. (2016) 'The Netflix Recommender System: Algorithms, business value, and innovation', ACM Transactions on Management Information Systems, 6(4), Available at: https://dl.acm.org/doi/10.1145/2843948
Koren, Y., Bell, R., & Volinsky, C. (2009) 'Matrix factorization techniques for recommender systems', Computer, 42(8), 30-37. Available at: https://ieeexplore.ieee.org/document/5197422

Summary

Machine learning is a branch of computer science and artificial intelligence that uses data and algorithms to imitate how humans learn, gradually improving accuracy. Machine learning uses algorithms to turn various data sets into models to make predictions or identify patterns from new data. Various machine learning algorithms are employed for specific tasks and objectives. For instance, the simple linear regression method applies to prediction tasks such as forecasting stock market trends, while the KNN algorithm is suited for classification problems. Machine learning's varied algorithms require tailored approaches, data quality, and ongoing evaluation for effective utilization, benefiting diverse industries.

Executive Summary

In my initial post, I discussed the importance of machine learning algorithm types and deployment patterns in the context of Netflix organization. I highlighted the benefits of machine learning systems, such as improved agility, adaptability, and modeling capabilities. I also mentioned that Netflix organizations must make informed choices when selecting machine learning algorithm types and deployment patterns to maximize their benefits (Databrick, 2023).

During the class discussions, I received valuable feedback from my peers, which provided different perspectives and insights. One of my peers shared a response mentioning the possible problem of sample bias in the data used for assessment and training while talking about Netflix's recommendation system. The other peer responded by utilizing Netflix's collaborative filtering and content-based filtering models to improve the recommendation systems.

Reflecting on these discussions, I have realized the importance of considering each organization's specific needs and constraints when choosing various machine learning algorithm types and deployment patterns (Barla, 2022). There is no universal solution that fits every situation; hence, it is suitable to choose a customized approach. I have also gained a deeper understanding of different machine-learning approaches' potential challenges and benefits.

In response to the class discussions, I have adjusted my initial perspectives. I now appreciate the complexity of machine learning implementation and the need for a holistic approach that considers data quality, model interpretability, and scalability. I have also learned that organizations should continuously evaluate and iterate on their machine-learning systems to ensure they remain effective and aligned with their goals (WEI, 2023).

Conclusion

My journey through this course has emphasized that machine learning is a versatile tool with the potential to revolutionize various industries. The importance of tailored approaches, data quality, and ongoing evaluation cannot be overstated. Organizations must carefully consider the complex landscape of machine learning to harness its full potential effectively.

References

Barla, N. (2022). Model Deployment Strategies. [online] neptune.ai. Available at: https://neptune.ai/blog/model-deployment-strategies 
Databricks (2023). Model deployment patterns. [online] docs.databricks.com. Available at: https://docs.databricks.com/en/machine-learning/mlops/deployment-patterns.html 
WEI, Y. (2023). MLOps in Practice — Machine Learning (ML) model deployment patterns (Part 1). [online] Efficient Data+AI Stack. Available at: https://medium.com/trigger-ai/mlops-in-practice-machine-learning-ml-model-deployment-patterns-part-1-ce7cb575feda

Scenario

A student-information-system (SIS) intends to include additional ML-based insights into its management reporting suite. One of the csv data files used to prime the SIS Machine Learning application has been found to have significant data quality issues.

You have been asked to provide advice and guidance on the steps that need to be taken to clean the data file as part of quality assurance for the ML system.

Requirements

Perform detailed analysis of data quality for the dataset provided.
Map out, in outline form, what data cleansing measures would need to be put in place to prepare the dataset for ML work.
Indicate the types of Pandas-Scikit-Learn python commands that could be used whilst carrying out the measures you suggested in task 2.
Attempt a clean-up of the dataset using the procedures set out in (2) and (3) and report on the degree of success with carrying this out.

References

Dong, Y. and Peng, C.-Y.J. (2013) ‘Principled missing data methods for researchers’, SpringerPlus, 2(1). doi:10.1186/2193-1801-2-222.
Jeon, G. et al. (2019) ‘Special issue on machine learning approaches and challenges of missing data in the era of Big Data’, International Journal of Machine Learning and Cybernetics, 10(10), pp. 2589–2591. doi:10.1007/s13042-019-01010-8.
Schlomer, G.L., Bauman, S. and Card, N.A. (2010) ‘Best practices for missing data management in counseling psychology.’, Journal of Counseling Psychology, 57(1), pp. 1–10. doi:10.1037/a0018082

※ Python Code

Topic

Organizations often have to decide which of the two broad classes of techniques to use when setting up machine learning systems or attempting to solve problems requiring machine learning solutions.

This discussion invites you to compare and contrast the two classes of machine learning systems, and to explain what criteria to utilize when deciding which specific approach will be the most appropriate when developing a machine learning solution.

Compare and contrast supervised and unsupervised machine learning methods in general, whilst making specific reference to at least two ML modelling techniques in each of the two broad classes. Your comparison should highlight strengths, weaknesses, and challenges for each of the ML modelling techniques mentioned in your post.
Draw up structured criteria that can be used in determining the most appropriate ML approach for developing a machine learning solution to a particular problem.

Post

Machine learning is considered part of the algorithms that help to get patterns from data and give specific outputs for conclusive predictions. Therefore, to draw the best conclusions, there is a need to apply machine learning in the best ways by setting up machine learning systems.

Supervised machine learning

The first approach to consider is supervised machine learning, which relies on the labelled inputs, and this also considers cleaned data, which gives a better result in the algorithm. The model has to learn the relationship between the output data and labelled inputs. The two main examples that help to understand supervised machine learning well are classification and regression. In regression, the input features and the target variables have a linear relationship, and the best example is speech recognition and the automated classification of automated documents (Alloghani et al., 2020). Therefore, the linear regression ensures that the input label leads to the intended results, as the written language is automated in the same way. The limitation of the method is that it might fail to recognize non-linear correlations in the data labels.

The second example is classification, and this is applied as a method of identifying input data, which is part of learned data with labels. Therefore, the classification is enhanced in a way that data will be structured in a way that it will be grouped based on where it should lie. For example, ML can be designed to classify books or even emails, where they will be in classes based on the input data. The limitation of the classification is that the computing power might classify data in the wrong class, and this leads to wrong inferences.

Unsupervised machine learning

In contrast, unsupervised machine learning it applies learning processes with raw data, and it is most applied to cluster data and detect outliers that might lead to wrong predictions. The two examples of unstructured machine learning are means clustering and principal component analysis (Berry et al., 2019). Clustering is important as it helps to group data with similar characteristics. The data can be in markets, where market campaigns need clear data points, and the ML will help to assign resources in the clear procedures. The drawback of the example is the sensitivity to the initial positioning and assumptions that the clusters are spherical. The second method is PCA, and in ML, it is applied to reduce the dimensionality of data. The method is important as it increases the interpretability of the data, as the data is linked together to increase prediction probability. The drawback is that the method can lead to data loss, and the data can lead to inconclusive results.

Structured criteria

The structured criteria follow a directed path that overlooks the supervised and unsupervised learning techniques. The first point is to comprehend the characteristics of the issue and the data collected. This point overlooks how important it is to reduce the data loss and expenses to maintain the data. Therefore, when accuracy is needed, one has to consider the best method that will lead to the best results (Ozgur, 2004). Thirdly, the computational tools available need to be analyzed, where the processing power is considered for better results. For example, if one has to deal with deep learning models, they will need supervised learning as the data is structured in the best ways.

References

Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A. and Aljaaf, A.J., 2020. A systematic review on supervised and unsupervised machine learning algorithms for data science. Supervised and unsupervised learning for data science, pp.3-21.
Berry, M.W., Mohamed, A. and Yap, B.W. eds., 2019. Supervised and unsupervised learning for data science. Springer Nature.
Ozgur, A., 2004. Supervised and unsupervised machine learning techniques for text document categorization. Unpublished Master’s Thesis, İstanbul: Boğaziçi University.

Summary

Machine learning is a process that collects algorithms that help to extract and predict patterns that give a simplified decision in business and computing organizations. Based on the learned processes, the supervised and the unsupervised ML are both important and the application is based on the data that is presented for analysis. According to the post made by one of the peers, the best example that was shared was on the Convolutional Neural Networks (CNNs), which are utilized in computer vision and image classification tasks (Heaton, 2017). In my understanding, image classification tasks are applied in websites where they would like to detect whether they are robots or human beings, and they extract hierarchical characteristics from the images in the area of expertise. However, the main drawback was the computing power, which was high, and at times, the unstructured data might need cleaning before they are applied.

The second idea that I found better to implement in my discussion post is the K-Means Clustering, which shows that the data needs to be divided into discrete groups. The large datasets can be divided into smaller groups as it helps to detect anomalies which happen in the systems. Therefore, in my opinion, when one has large datasets, one can subdivide them into groups, and this helps to ensure that all errors can be recognized before any decision or prediction is made. On the other hand, structured ML is simple to apply as it is always straightforward because of the data classification. For example, on linear regression, one has to look at the input features, where the input data leads to a conclusive report because the prediction is guaranteed (Olive, 2017). This shows that when one has to choose between the two methods, the characteristics of the issue and the complexity of the data have to be analyzed. Secondly, the computational tools available have to be considered, whereby the computing power is recorded before any decision is made. This proves that when the data is labelled, one can apply supervised learning, while if the data is complex, one is free to change and apply unsupervised Machine learning.

References

Heaton, J. (2017). Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning. Genetic Programming and Evolvable Machines, 19(1-2), pp.305–307. doi:https://doi.org/10.1007/s10710-017-9314-z.
Olive, D.J. (2017). Linear Regression. Cham: Springer International Publishing. Doi:https://doi.org/10.1007/978-3-319-55252-1.

Scenario

You have been engaged as a contract data scientist by Athana Data Science Services (ADSS), a small company specialising in the provision of data science consultancy services to public and private sector organisations. ADSS have just been awarded a contract by a government department (the Department of Environment) to help with the development of machine learning-based models for predicting atmospheric emissions (and pollution) from data gathered by various borough and county environment monitoring units. Your team leader wants you to assist with this project, and you will be required to carry out a number of tasks using the Anaconda/Scikit-Learn Python ML framework and its components.

Datasets to be utilized for developing the ML models

Access the following URL: https://data.london.gov.uk/dataset/london-atmospheric-emissions-inventory-2013 [332MB]

The web page describes the various types of data held for monitored atmospheric emissions for the year 2013. Note that ADSS will ultimately be working on datasets compiled over a 50-year period, but you will be focusing only on a single year (2013), in the first instance, when developing the ML models.

Tasks

Identify and describe in some detail at least 3 machine learning algorithms/techniques that you intend to use in your project. Provide your reasons for selecting those ML methods.
Specify the types of predictive insights you expect to glean from the data after you have applied your ML models. Your response should be based on actual inspection of the datasets and should be as specific as possible.
Develop the respective ML models using your Jupyter notebook and Anaconda/Scikit-Learn toolkit to work on the datasets available on the website.
Assess the performance of each model using suitable ML metrics and explain in detail any differences in model performance.

Given the continuous nature of pollutants' concentration levels, regression-based models are preferable . Therefore, to model PM concentrations, we optimised sev-eral ML algorithms, underscoring the complexity of the task and our aim to identify the best model through extensive experimentation .

Linear regression (LR) can provide precise predictions when applied to air pollution data, which is crucial for developing effective air quality management strategies (Eduk et al., 2018; Załuska and Gładyszewska-Fiedoruk, 2020; Joseph et al., 2021; Dhanalakshmi and Radha, 2022) .

In this context, Random Forest (RF) algorithm has gained popularity, attributed to its adeptness in managing complex datasets and its proficiency in handling non-linear data and missing values (Anıl & Ümit, 2022; Agibayeva et al., 2022; Ahmad et al., 2022; Siouti et al., 2022; Rao & Thangaraj, 2023). Furthermore, RF's capa-bility to provide feature importance scores offers valuable insights into factors con-tributing to air pollution.

Gradient Boosting (GB) is recognised for its ability to build and generalise models by optimising an arbitrary differentiable loss function, offering robustness to outliers, and often surpassing the accuracy of RF, especially with non-linear data (Singh et al., 2021; Ghanim et al., 2023; Hong & Lynn, 2020). Its aptitude for swiftly adapting to data pattern alterations renders it suitable for predicting pollution levels (Wei et al., 2019a; Wei et al., 2019b).

Ridge Regression (RR) demonstrates good performance in handling multicollinear-ity, a crucial aspect when dealing with environmental data, and its regularisation term aids in averting overfitting, especially pivotal when navigating through com-plex environmental datasets (Dhanalakshmi & Radha, 2023; Nowack et al., 2021; Weng et al., 2022). As a linear model, RR brings forth simplicity and computational efficiency, aligning well with applications where interpretability and computational resources are paramount.

Lasso Regression (LR) has been utilised in the context of air pollution (e.g., as in Hassan and Bhuiyan, 2021). It inherently performs feature selection by shrinking some of the coefficient estimates to exactly zero, which can be particularly useful when dealing with datasets having numerous predictors. It handles multicollinearity well (Cheng et al., 2020). LR provides sparse solutions and improved predictive accuracy, as compared to LR, especially when the true underlying model is sparse .

XGBoost often delivers superior predictive accuracy compared to other ML mod-els (Huang et al., 2023; Kriit et al., 2022; Shah et al., 2022; Cabaneros et al., 2022) and can capture non-linear relationships between predictors and the target variable (e.g., pollutant concentration) (Huang et al., 2023). XGBoost can handle missing data internally without requiring imputation (Kriit et al., 2022), less sensitive to mul-ticollinearity (Shah et al., 2022) and provides feature importance scores .

Neural Networks (NNs) are highly flexible and capable of learning complex pat-terns and non-linear relationships (Miller et al., 2020). Given the complexity of en-vironmental data, a NN can potentially uncover intricate relationships between dif-ferent features and air pollution levels (Lin et al., 2022; Pawul 2019; Ramentol et al., 2023). NNs can model complex relationships and discover intricate patterns in the data, making them ideal for predicting PM concentrations based on a multitude of environmental factors.

ML models were developed using Python libraries (i.e., Scikit-Learn and XGBoos) in a Jupyter notebook environment. The dataset was loaded from Excel files, pre-processed, split into training, testing and validation sets, and then fed into the models. Hyperparameter tuning was applied to optimise model performance, par-ticularly for the NNs. The 'Total' column was used as a proxy for PM10 since it en-capsulates all the available information. Various quality checks were performed thereafter. Data related to different pollutants was merged based on common iden-tifiers, creating a consolidated dataset. Specific columns were renamed for clarity and consistency .

To explore the relationships between different pollutants, a heatmap was generat-ed, see Figure 1.

Figure 1. Heatmap: Correlations between variables.

The correlation coefficients showed that PM10 has stronger associations with the variables of interest, which is crucial for explor-ing multifaceted interactions and dependencies. Hence, PM10 emerges as a more suitable choice for our study. We examined PM10 using a histogram and comput-ed summary statistics, and 151 outliers were identified (see Figure 2 and Figure 3).

Figure 2. PM10 before the Box-Cox transformation.

Figure 3. PM10 after the Box-Cox transformation.

Hence, applied different transformation techniques (i.e., loga-rithmic, a square root and the Box-Cox) and used a boxplot to assess their im-pact . Figure 2 shows PM10 before transformation. As seen in Figure 3, the out-liers were effectively addressed . Finally, duplicate columns and columns with high correlations were identified and removed to avoid multicollinearity. Lagged features of PM10 were created to potentially explore temporal relationships in fu-ture analyses. Resulting missing values were backward filled .

Agibayeva, A., Khalikhan, R., Guney, M., Karaca, F., Torezhan, A., & Avcu, E. (2022). An Air Quality Modeling and Disability-Adjusted Life Years (DALY) Risk Assessment Case Study: Comparing Statistical and Machine Learning Approach-es for PM2.5 Forecasting. Sustainability, 14(24), 16641.
Ahmad, N., Ul-Saufie, A. Z., Shaziayani, W. N., Abidin, A. W. Z., Elis, N., Zulazmi, S., & Harb, S. M. (2022). Evaluating the Performance of Random Forest and Mul-tiple Linear Regression for Higher Observed PM10 Concentrations. Israa Universi-ty Journal of Applied Science (IUJAS), 6(1), 72-90.
Ameer, S., Shah, M. A., Khan, A., Song, H., Maple, C., Islam, S. U., & Asghar, M. N. (2019). Comparative Analysis of Machine Learning Techniques for Predicting Air Quality in Smart Cities. IEEE Access, 7, 128325-128338.
Anıl, U. T. K. U., & Ümit, C. A. N. (2022). Deep Learning Based Air Quality Predic-tion: A Case Study for London. Türk Doğa ve Fen Dergisi, 11(4), 126-134.
Cabaneros, S., Calautit, J., & Hughes, B. (2020). Short- and Long-Term Forecast-ing of Ambient Air Pollution Levels using Wavelet-Based Non-Linear Autoregres-sive Artificial Neural Networks with Exogenous Inputs. Ecohydrology & Hydrobiol-ogy, 3(2), 143-154.
Cheng, X., Huang, K., Qu, L., Zhang, T., & Li, L. (2020). Effects of Vehicle Re-striction Policies on Urban Travel Demand Change from a Built Environment Per-spective. Journal of Advanced Transportation.
Dhanalakshmi, M., & Radha, V. (2022). Discretised Linear Regression and Mul-ticlass Support Vector Based Air Pollution Forecasting Technique. International Journal of Engineering Trends and Technology, 70(11), 315-323.
Dhanalakshmi, M., & Radha, V. (2023). Novel Regression and Least Square Sup-port Vector Machine Learning Technique for Air Pollution Forecasting. International Journal of Engineering Trends and Technology, 71(4), 147-158.
Eduk, A. R., Leo, C., Andrew, A., & Mudiaga, C. O. (2018). Prediction and Model-ling of Dry Seasons Air Pollution Changes using Multiple Linear Regression Mod-el: A Case Study of Port Harcourt and its Environs, Niger Delta, Nigeria. Interna-tional Journal of Environment, Agriculture and Biotechnology, 3(3).
Ghanim, A. A., Shaf, A., Ali, T., Zafar, M., Al-Areeq, A. M., Alyami, S. H., ... & Rahman, S. (2023). An Improved Flood Susceptibility Assessment in Jeddah, Saudi Arabia, using Advanced Machine Learning Techniques. Water, 15(14), 2511.
Hassan, S., & Bhuiyan, M. A. (2021). Probable Nexus between Methane and Air Pollution in Bangladesh using Machine Learning and Geographically Weighted Regression Modelling. Journal of Hyperspectral Remote Sensing v, 11(3), 136-145.
Hong, S., & Lynn, H. S. (2020). Accuracy of Random-Forest-Based Imputation of Missing Data in the Presence of Non-Normality, Non-Linearity, and Interaction. BMC Medical Research Methodology, 20(1), 1-12.
Huang, W., Wang, S., Wang, L., Song, Y., Zhu, Y., Yang, H., Xie, Y., & Hu, Y. (2023). Source Apportionment of Soil Heavy Metal (Loid)s in Farmland Using Di-verse Models: A Comparative Assessment in the Yellow River Delta. Journal of Marine Science and Engineering, 11(5), 1069.
Joseph, J., Furl, C. V., Sharif, H., Sunil, T., & Macias, C. G. (2021). Towards Im-proving Transparency of Count Data Regression Models for Health Impacts of Air Pollution. Applied Sciences, 11(8), 3375.
Kriit, H., Andersson, E., Carlsen, H., Andersson, N., Ljungman, P., Pershagen, G., Segersson, D., Eneroth, K., Gidhagen, L., Spanne, M., Molnár, P., Wennberg, P., Rosengren, A., Rizzuto, D., Leander, K., Yacamán-Méndez, D., Magnusson, P., Forsberg, B., Stockfelt, L., & Sommar, J. (2022). Using Distributed Lag Non-Linear Models to Estimate Exposure Lag-Response Associations between Long-Term Air Pollution Exposure and Incidence of Cardiovascular Disease. International Journal of Environmental Research and Public Health, 19(5), 2630.
Lin, C., Yousefi, S., Kahoro, E., Karisani, P., Liang, D., Sarnat, J., & Agichtein, E. (2022). Detecting Elevated Air Pollution Levels by Monitoring Web Search Que-ries: Algorithm Development and Validation. JMIR Formative Research, 6(12), e23422.
Miller, T., Cembrowska-Lech, D., Kisiel, A., Kołodziejczak, M., Krzemińska, A., Jawor, M., ... & Mosiundz, S. (2023). Advancing water quality monitoring through artificial neural networks: present insights and future opportunities in scientific ex-ploration. Scientific Collection «InterConf+», 3(151), 399-409.
Naiman J., & Song P.X. (2022). Multivariate Functional Kernel Machine Regres-sion and Sparse Functional Feature Selection.Entropy. 24*(2), 203.
Nowack, P., Konstantinovskiy, L., Gardiner, H., & Cant, J. (2021). Machine learn-ing calibration of low-cost NO 2 and PM 10 sensors: Non-linear algorithms and their impact on site transferability. Atmospheric Measurement Techniques, 14(8), 5637-5655.
Pawul, M. (2019). Application of Neural Networks to the Prediction of Gas Pollu-tion of Air. New Trends in Production Engineering, 2.
Ramentol, E., Grimm, S., Stinzendörfer, M., & Wagner, A. (2023). Short-Term Air Pollution Forecasting Using Embeddings in Neural Networks. Atmosphere, 14(2), 298.
Rao, N. S. S. V. S., & Thangaraj, S. J. J. (2023, April). Flight Ticket Prediction us-ing Random Forest Regressor Compared with Decision Tree Regressor. In 2023 Eighth International Conference on Science Technology Engineering and Mathe-matics (ICONSTEM (pp. 1-5)). IEEE.
Samet, J., Dockery, D., & Schwartz, J. (2000). The Impact of PM10 on Human Health. Environmental Health Perspectives, 108(4), 477-484.
Shah, M. H., Salem, S., Ahmed, B., Ullah, I., Rehman, A., Zeeshan, M., & Fareed, Z. (2022). Nexus Between Foreign Direct Investment Inflow, Renewable Energy Consumption, Ambient Air Pollution, and Human Mortality: A Public Health Per-spective from Non-linear ARDL Approach. Frontiers in Public Health.
Seyedzadeh, S., Rahimian, F. P., Rastogi, P., & Glesk, I. (2019). Tuning Machine Learning Models for Prediction of Building Energy Loads. Sustainable Cities and Society, 47, 101484.
Siouti, E., Skyllakou, K., Kioutsioukis, I., Patoulias, D., Fouskas, G., & Pandis, S. N. (2022). Development and Application of the SmartAQ High-Resolution Air Qual-ity and Source Apportionment Forecasting System for European Urban Areas. At-mosphere, 13(10), 1693.
Singh, U., Rizwan, M., Alaraj, M., & Alsaidan, I. (2021). A Machine Learning-Based Gradient Boosting Regression Approach for Wind Power Production Fore-casting: A Step towards Smart Grid Environments. Energies, 14(16), 5196.
Wallek, S., Langner, M., Schubert, S. & Schneider, C. (2022) Modelling Hourly Particulate Matter (PM10) Concentrations at High Spatial Resolution in Germany Using Land Use Regression and Open Data, Atmosphere, 13(8), 1282.
Wang, X., & Liu, Y. (2018) A Deep Learning Approach for PM10 Prediction. Envi-ronmental Pollution, 238, 1013-1021.
Weng, X., Forster, G. L., & Nowack, P. (2022). A Machine Learning Approach to Quantify Meteorological Drivers of Ozone Pollution in China from 2015 to 2019. Atmospheric Chemistry and Physics, 22(12), 8385-8402.
Wei, L., Yuan, Z., Zhong, Y., Yang, L., Hu, X., & Zhang, Y. (2019a). An Improved Gradient Boosting Regression Tree Estimation Model for Soil Heavy Metal (Arse-nic) Pollution Monitoring using Hyperspectral Remote Sensing. Applied Sciences, 9(9), 1943.
Wei, L., Huang, C., Wang, Z., Wang, Z., Zhou, X., & Cao, L. (2019b). Monitoring of Urban Black-Odor Water Based on Nemerow Index and Gradient Boosting Deci-sion Tree Regression using UAV-Borne Hyperspectral Imagery. Remote Sensing, 11(20), 2402.
World Health Organisation (2016). Ambient Air Pollution: A Global Assessment of Exposure and Burden of Disease; WHO: Geneva, Switzerland.
Załuska, M., & Gładyszewska-Fiedoruk, K. (2020). Regression Linear Model of Air Pollution Emission on the Example of a Waste Incineration Plant. In Proceedings, 51(1), 32. MDPI.

※ Python Code