Journal Search Engine

ISSN : 2005-0461(Print)
ISSN : 2287-7975(Online)

Journal of Society of Korea Industrial and Systems Engineering Vol.46 No.3 pp.186-197
DOI : https://doi.org/10.11627/jksie.2023.46.3.186

Development of a Framework for Improvement of Sensor Data Quality from Weather Buoys

Ju-Yong Lee^*

, Jae-Young Lee^**

, Jiwoo Lee^**

, Sangmun Shin^***

, Jun-hyuk Jang^****

, Jun-Hee Han^**†

^*Division of Business Administration & Accounting, Kangwon National University
^**Department of Industrial Engineering, Pusan National University
^***Department of Industrial & Management Engineering, Dong-A University
^****Maritime Digital Transformation Research Center, Korea Research Institute of Ships & Ocean Engineering

^†Corresponding Author : junhan@pusan.ac.kr

Received 15/09/2023 Finally Revised 17/09/2023 Accepted 18/09/2023

Abstract

In this study, we focus on the improvement of data quality transmitted from a weather buoy that guides a route of ships. The buoy has an Internet-of-Thing (IoT) including sensors to collect meteorological data and the buoy’s status, and it also has a wireless communication device to send them to the central database in a ground control center and ships nearby. The time interval of data collected by the sensor is irregular, and fault data is often detected. Therefore, this study provides a framework to improve data quality using machine learning models. The normal data pattern is trained by machine learning models, and the trained models detect the fault data from the collected data set of the sensor and adjust them. For determining fault data, interquartile range (IQR) removes the value outside the outlier, and an NGBoost algorithm removes the data above the upper bound and below the lower bound. The removed data is interpolated using NGBoost or long-short term memory (LSTM) algorithm. The performance of the suggested process is evaluated by actual weather buoy data from Korea to improve the quality of ‘AIR_TEMPERATURE’ data by using other data from the same buoy. The performance of our proposed framework has been validated through computational experiments based on real-world data, confirming its suitability for practical applications in real- world scenarios.

Key Words : Weather Buoy , Data Quality Management , Machine Learning , Data Fault Detection , Data Interpolation

해양기상부표의 센서 데이터 품질 향상을 위한 프레임워크 개발

이 주용^*, 이 재영^**, 이 지우^**, 신 상문^***, 장 준혁^****, 한 준희^**†

^*강원대학교 경영회계학부
^**부산대학교 산업공학과
^***동아대학교 산업경영공학과
^****선박해양플랜트연구소 해사디지털서비스연구센터

초록

키워드 :

This article has been cited by 0 article in crossref

Cited-By

Funding:

1. Introduction

Weather buoys floating near ports play as a warning sign of reefs or dangerous zones for ships approaching ports. Also, the buoys emit light to guide ships into or out of ports in bad weather. Usually, weather condition near the coast changes suddenly and it is frequently foggy, rainy, or windy. Thus, the role of buoys is very important. Particularly, South Korea, which is dealt with as a case study in this paper, is a peninsular surrounded by sea on three sides. Hence, there are many ports along the coast and lots of buoys are installed near the ports. In <Figure 1>, red points indicate the locations of buoys installed on the sea of Korea.

Recently, weather buoys have been equipped with the Internet of Things (IoT) sensors and wireless communication devices. Sensors measure meteorological data (e.g., air and water temperature, air pressure, humidity, and wind speed) and the buoy’s status (e.g., battery voltage and electricity, and solar charging). Then, wireless communication devices transmit the data to the central database in a ground control center and the ships nearby at irregular intervals. <Figure 2> is an example of a buoy with sensors and a wireless communication device. Ships use the data provided by the buoys to plan their routes and schedules. The accuracy of the data is therefore essential to fulfill the purpose of the buoy as a navigational aid.

The sensors considered in this study observe air and water temperature, air pressure, humidity, wind speed, and data collection time and send them by wireless. Usually, meteorological data shows complex patterns with daily variation and seasonality. In particular, the weather at sea changes more suddenly than the weather on land. In addition, weather buoys are not stationary but float on the sea, and they are always exposed out-of-door. Besides, the buoys are very far from the ground control center. Owing to these reasons, transmission failure may often, or abnormal data is transmitted frequently. According to Li et al. [16], these erroneous data from sensors can be divided into two types: incipient failure and abrupt failure. The former causes null data, the latter makes abnormal values (outliers). These errors make it difficult for users to properly utilize the collected data. Therefore, it is important to detect and correct errors such as nulls, and outliers.

This study focuses on the quality of meteorological data transmitted from weather buoys. Ships receive the meteorological data from nearby buoys, and they make routes and schedules and check safety. However, as mentioned, the sensors of the buoys occasionally experience incipient and abrupt failures. When ships receive data including errors such as null and outliers or do not receive any data, they request normal data from a ground control center. Therefore, errors in meteorological data in the ground control center must be detected and interpolated in real time to provide reliable data whenever ships request it. To achieve high reliability in the buoy’s data, data cleansing is necessary not to contain errors.

To improve the quality of meteorological data from weather buoys, this study develops a framework of methods to methods to detect and interpolate erroneous data to provide the ships at sea with reliable data, and it should be processed immediately right after receiving the data. The framework is based on machine learning models for data preprocessing, error detection, and interpolation. After training the machine learning models with historical meteorological data, the framework can detect and interpolate the errors instantly as soon as receiving the data. Also, as Barnett and Lewis [2] defined, there may need to use several kinds of algorithms to remove fault data and interpolate as new forecasting data. Therefore, we develop a new framework to improve the quality of air temperature data above the sea to integrate appropriate machine learning algorithms.

This paper is organized as follows. Section 2 introduces related studies and specifies its limitations. Section 3 introduces the methods sequentially performed in the proposed framework, and Section 4 validates the performance of the suggested framework using data collected from buoys on the sea of Korea. Finally, Section 5 presents the conclusions and further research of this study.

2. Related Studies and limitations

Buoys have been used for a few decades, so there are many studies on the uses of buoy data for various objectives. Among them, Reynolds [19] suggested a platform for monitoring the sea surface temperature using satellites. Buoy data can be used as a comparison group to check the quality of data from satellites [8] or several types of forecasting systems [3]. From, they use buoy data itself. Heidemann et al. [11] introduced research challenges related to underwater sensor networking, while [15] focused on the collective motion of the ocean sensor network. Among these researches, Venkatesan et al. [20] focused on drift characteristics of temperature sensors of buoys, and this means that if the data is not calibrated to normal data, the quality of the data may continue to deteriorate.

Among many related studies, our suggested problem is more related to data fault detection. Barnett and Lewis [2] defined outliers as observations (or subsets of observations) that appear to be inconsistent with the rest of the data in a given dataset. Qin [18], Ge et al. [10], and Dai and Gao [6] introduced several related studies related to fault detection and diagnosis that have been considered in the areas of monitoring industrial processes such as safety, quality, and operation efficiency, etc. For other applications, Yin et al. [23], Zhang et al. [25], and Xiang et al. [22] developed a convolution neural network, XGboost, and LSTM for fault detection of wind turbine data, respectively.

Our suggested problem also considers the replacement of fault data as a normal value, so the forecasting algorithm is also needed. There have been a lot of related studies for data forecasting, so we focused on temperature forecasting research. Abdel-Aal [1] proposed alternative abductive networks approach to forecast next-day and next-hour temperatures. Zamora-Martinez et al. [24] proposed an online learning model for forecasting indoor temperatures. Cifuentes et al. [5] reviewed related studies related to air and water temperature forecasting using machine learning techniques from short-term like daily and hourly to long-term global temperature forecasting global warming during decades.

Although there are a lot of related studies related to data cleansing, data fault detection, and forecasting, it is important to integrate methodologies to find fault data and interpolate them. This study can contribute to the literature by developing a framework that can operate both fault detection and interpolation simultaneously with the different machine learning models. It can also contribute to identifying errors and interpolating them to appropriate values in real time as data is collected.

3. Models for Data Quality Improvement

This section describes the considered problem in more detail and proposes a framework to improve data quality. This study considers meteorological data transmitted from sensors of weather buoys. Every weather buoy in Korea collects air temperature, air pressure, humidity as a default, and wind speed. Among the various data, this study focuses on air temperature data because it is the most effective factor for sensing the changes in weather in the sea. Note that air temperature is the most important indicator of changes in atmospheric circulation and is a factor that can represent the meteorological conditions of the sea.

Generally, temperature data recorded successively on land shows trends and has small variations. However, because weather buoys are floating and communicate wirelessly with a distant ground control center or ships, the variation is larger than that in land and the intervals of time stamps in the database are not regular. Additionally, the data includes erroneous values such as outliers and nulls, where an outlier is an observation that lies far away from the central of a given data set, and null data represents the intentional absence of any object value. Thus, this study aims to develop a framework to improve the quality of the air temperature data from weather buoys. <Figure 3> shows a flow chart of the proposed framework with several models to detect erroneous values and interpolate them.

First, ‘Data Collection’ step collects and stores the data transmitted from the weather buoys. Note that this study assumes the data is already given and hence the procedure of the framework starts at the second step ‘Data Check’ to make the irregular intervals between two successive time stamps regular. By scanning the data and counting the frequencies of time intervals, it finds the most frequent interval among the ones between two successive time stamps, which is called normal interval. Then, if there are time stamps within the normal interval, they are removed. Otherwise, if an interval between two timestamps is longer than normal, null data is inserted between the two timestamps to make up for the time gap. After ‘Data Check’, ‘Data Preparation’ for machine learning is performed. In this step, additional input values such as mean, standard deviation, and min/max of each column are generated along with existing column values. The next step ‘Fault Detection’ detects outliers, excepting nulls. Two methods are used to detect outliers: the interquartile range method and NGBoost. Using these methods, the detected outliers are substituted by null at step ‘Fault Data Remove’. Finally, ‘Data Interpolation’ replaces the nulls with proper values using machine learning models: CatBoost, XGBoost, Random Forest, NGBoost, and long-short term memory (LSTM). These models are used due to the type of error. If only the temperature data (target data) is removed or missing, the CatBoost, XGBoost, Random Forest, and NGBoost algorithms can interpolate the object value by using other data sets such as humidity, and air pressure as input data. Note that, unlike the 'Fault Detection' step in the flowchart, we didn’t specify an algorithm in the ‘Data Interpolation’ step because only one of the proposed algorithms is used in the framework. The subsequent subsections give a detailed description of the suggested models: interquartile range and NGBoost-I for detecting outliers, and LSTM, CatBoost, XGBoost, Random Forest, and NGBoost-II for interpolation.

3.1 Detecting outliers

We use two outlier removal algorithms: interquartile range (IQR) and NGBoost-I. These algorithms run sequentially.

3.1.1 Interquartile range (IQR)

The interquartile range (IQR) is a robust statistical method used to measure the spread or dispersion of a dataset by focusing on the middle 50% of the dataset. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1) of the dataset. Quartiles are values that divide a dataset into four equal parts, with Q1 representing the 25th percentile and Q3 representing the 75th percentile. In this model, Q1-1.5×IQR is used as a lower bound while Q3+1.5×IQR is used as an upper bound. By discarding the extreme values in the dataset, the interquartile range provides a more reliable measure of variability, particularly in the presence of outliers or skewed data. In our observed sensor data, there are instances where values are not null, but due to sensor errors, extreme values are recorded. Additionally, there are cases where the sensor fails to provide data, leading to periodic output of pre-defined default values. In such scenarios, the Interquartile Range (IQR) proves to be an effective method for filtering out anomalies. In this study, we leverage the IQR algorithm based on the Q1 and Q3 values of the data from the past 7 days, and values under the lower bound and above the upper bounds are detected as outliers.

3.1.2 NGBoost-I

The Natural Gradient Boosting (NGBoost) algorithm is a machine learning algorithm that combines the principles of gradient boosting with a probabilistic framework suggested by Duan et al. [7]. It's designed to perform probabilistic regression and classification tasks while also providing predictive uncertainty estimates for its predictions. Traditional gradient boosting algorithms, such as the popular XGBoost and LightGBM, focus on minimizing the mean squared error or other point-wise loss functions to improve predictive accuracy. NGBoost, on the other hand, extends this concept by optimizing a probabilistic loss function. This means that NGBoost doesn't just provide point predictions, but also models the entire predictive distribution.

NGBoost algorithm consists of three steps, each of which is ‘base learners’, ‘probability distribution’, and ‘scoring rule’. The base learner is a step for making a decision tree, while the probability distribution step makes a distribution by using normal distribution or Bernoulli distribution. Note that air temperature has continuous values, so Normal distribution is used. Then, the scoring rule evaluates the performance of the prediction.

NGboost can give the distribution of the prediction values, and it can be used well to find outlier data. NGboost algorithm gives not only a prediction value but also its upper and lower bounds. If the value is under the lower bound or over the upper bound, it can be detected as an outlier value. The gap between the upper and lower bound can be manipulated by controlling the confidence interval. NGBoost-I, which we propose, utilizes 19 types of input data presented in <Table 1>.

3.2 Data Interpolation

After detecting a single outlier, the interpolation process should be operated to interpolate the null values. The interpolation requires the latest 30 data, but if outlier values are used as they are or null values exist, the accuracy of the NGBoost model inevitably decreases. Therefore, it is necessary to immediately substitute an interpolated value after an error is detected. The temperature data we need to interpolate exhibits a recurring trend over the course of a day, making accurate interpolation challenging with basic techniques such as moving averages or exponential smoothing. We use another NGBoost algorithm (NGBoost-Ⅱ), Long Short-Term Memory (LSTM), XGBoost, and CatBoost for interpolation and compare the performance of these algorithms.

<Table 2> shows the input values of each model for interpolation. Note that in the case of NGBoost-Ⅱ, we reduce input parameters from NGBoost-I. The reason is that the purpose of NGBoost-I is to detect outliers, and NGBoost-Ⅱ is focused on finding appropriate interpolate values. Due to our pre-test, the performance of boosting algorithms is increased when we remove 12 input columns (average, standard deviation, minimum, and maximum) of HUMIDITY, WIND_SPEED, and AIR_PRESSURE). If we do not consider all input values of a sensor in NGBoost- I, it may detect the normal value as an ‘outlier’. Therefore, NGBoost-I can give robust values (wide width of prediction interval), and NGBoost-Ⅱ gives the proper values with narrow width of the prediction level.

Also, since Hochreiter and Schmidhuber [12] introduced long short-term memory (LSTM), there have been many studies for temperature data that LSTM-based algorithms give a good forecasting performance compared to existing forecasting models [9,13,14,17]. Therefore, we also use LSTM for forecasting null values that were detected and removed in the related steps. LSTM only uses 30 related AIR_TEMPERATURE values to predict the next data.

We employ Random Forest, CatBoost, and XGBoost methods that fall under the ensemble learning methodology. Their fundamental concept involves amalgamating multiple basic models known as weak learners to create a robust model capable of achieving enhanced predictive accuracy. These approaches have emerged as a promising alternative in the field of building energy efficiency and have garnered support through numerous studies, such as those demonstrating their efficacy in forecasting energy consumption [21] and predictive energy models [4].

4. Computational Experiments

This section reports the evaluation of the suggested models of IQR and NGBoost-I for detecting outliers and NGBoost-II, XGBoost, CatBoost, Random Forest, and LSTM for interpolation to improve data quality transmitted from a weather buoy. For the evaluation, we used the ‘AIR_TEMPERATURE’ data of August 30, 2022. The data is transmitted every minute, so there are 1440 values in a day. Among them, 30% were randomly selected and replaced with three types of outliers (0, large, and small), each with 10%. It then explores the outlier data representing the rate of increase/decrease using a total of 4 values (3%, 5%, 7%, 10%). Therefore, there are 4 scenarios in total. <Figure 4> gives examples of generated data based on the real-world case of a marine buoy. In <Figure 4>, The x-axis of the graph represents time, and the y-axis represents temperature. Red marks mean the generated outliers and blue dots are the original/normal data of ‘AIR_TEMPERATURE’.

As we mentioned in the related section, the NGBoost algorithm provides different upper and lower bounds depending on the width of the prediction interval. Therefore, we test the accuracy of outlier detection by changing the probability of the prediction intervals to 90%, 95%, 99%, and 99.5%. Also, since the interpolated values are used for the next outlier detection, outlier detection results may be affected by the type of model used for interpolation. Hence, it is essential to evaluate and compare the effectiveness of each combination of prediction interval (PI) and interpolation algorithms for both fault detection and data interpolation, rather than conducting separate validations for each. We tested 4 levels of PI (90%, 95%, 99%, and 99.5%) and 5 kinds of interpolation algorithms (CatBoost, NGBoost, XGBoost, Random Forest, and LSTM) within 4 scenarios. Therefore, there are 80 combinations of results for finding the most effective methodology.

In our framework, it is important to detect outliers along with an interpolation step. If an outlier is not detected and removed properly, it is mistaken for a normal value and predicts the next value, resulting in a very large interpolation error. Therefore, the width of the prediction interval of NGBoost-I, which determines the outlier, is important, and the prediction rate according to the width of the prediction interval should be compared with the results. Therefore, we compare accuracy, precision, and recall to see if they are suitable. The results of the four scenarios for each prediction interval are summarized in <Table 3> ~ <Table 6>.

<Table 3> ~ <Table 6> show the accuracy, precision, recall, and F1-score of all combinations. Accuracy is the ratio of the number of correct predictions to the total number of fault data that we generated. If the accuracy is 1.00, all outlier data are detected. True Positive (TP) is the number of true outlier values determined by the model to be outliers. Conversely, TN is the number of actual non-outliers that the model judges are not outliers. False Positive (FP) is when a model wrongly predicts a positive outcome. False Negative (FN) is when a model wrongly predicts a negative outcome. These terms are crucial for evaluating classification model performance. Using these columns, we can calculate precision and recall value. Precision is the accuracy of positive predictions by measuring the proportion of correct positive predictions out of all predicted positives, and recall evaluates a model's ability to identify all actual positives by calculating the proportion of correct positive predictions out of all actual positives.

As shown in <Table 3>, the accuracy and precision of 90% PI are better than other PI values in scenario 1. Scenario 1 generated outliers with 3% increase or decrease, which means normal and abnormal (outlier) data is the closest among 4 scenarios. By applying Prediction Intervals (PI) of 90%, we observe that the NGBoost-Ⅰ algorithm demonstrates the narrowest gap between the upper and lower bounds of normal data points. Consequently, this approach proves effective in filtering out anomalies that are closely situated to normal data points. However, we observe that the False Negative (FN) values are the highest at 90%. This observation suggests that due to narrow prediction intervals, instances of misclassifying normal data points as anomalies are prevalent. Across all algorithms, a consistent pattern emerges, with the combination of NGBoost with a 90% PI and XGBoost yielding the most promising results. On the other hand, at PIs of 99% and 99.5%, most normal data points are correctly identified, although many anomalies go undetected.

Scenario 2 (5%) in <Table 4> exhibits higher accuracy and precision compared to scenario 1. Unlike scenario 1, scenario 2 performs best at a 95% PI. Among the algorithms in this scenario, LSTM and Random Forest achieve the highest accuracy of 0.97, with LSTM displaying the highest precision. <Table 5> and <Table 6> present the results of scenarios 3 and 4 respectively. These scenarios demonstrate improved algorithm performance due to clearer differentiation between anomalies and normal data points compared to scenarios 1 and 2. Scenario 3 shows that most algorithms, except NGBoost-II, perform well at PIs of 90% and 95%. However, as the PI increases to 99% and 99.5%, the number of False Positives (FP) substantially rises, adversely affecting model precision. In <Table 6>, scenario 4, characterized by the largest distinction between anomalies and normal data points, displays a remarkable ability of PIs at 99% and 99.5% to almost entirely filter out anomalies, except for XGBoost which has non-zero FP at these PIs.

<Table 7> shows the mean absolute percentage error (MAPE) of all combinations of 4 scenarios. Note that in <Table 7>, SC1 (3%) refers to a scenario in which outliers are generated by increasing or decreasing the original data by 3%.

LSTM shows the best performance in 15 cases out of 16. Random Forest gives the best result in SC2 with 90% PI. When comparing average values, LSTM gave the best result with 0.51%. Random Forest gives 0.62%. The results of 3 kinds of boosting algorithms followed random forest. The results mean that LSTM is a suitable methodology for predicting temperature data, and related temperature values are enough to predict the next one. In particular, the temperature data of a weather buoy used in this study are often similar to related data due to the short data collection interval. Therefore, LSTM, which reflects the characteristics of time series data well, can be considered more effective.

Based on the accuracy and MAPE results, it is deduced that when the percentage of anomalies exceeds 7%, PIs of 99% or 99.5% should be employed. For scenarios within this threshold, a PI of 95% proves optimal in revealing errors. However, considering the irregular distribution of anomaly gaps in real-world data and the possibility of data points closely resembling true values, a 95% PI is deemed appropriate. Although both Random Forest and LSTM exhibit similar capabilities in accurately filtering anomalies, MAPE considerations suggest that LSTM is better suited for interpolation. Hence, the proposed optimal approach involves utilizing the NGBoost-I algorithm with a 95% PI for anomaly detection, followed by LSTM for the removal of detected anomalies and subsequent interpolation. This approach is considered superior.

We tested the NGBoost-I with 95% PI and LSTM on a real dataset on August 31, 2022. In <Figure 5>, the purple width represents the 95% prediction interval by NGBoost-I. The blue dots in the red circles outside the prediction interval are judged to be outliers, which are replaced by the interpolated values in the orange dots. When applied to the real-world model, our model produced relatively reasonable results and the MAPE was 0.17%.

5. Conclusion

In this study, we developed a framework to improve the quality of data transmitted from a weather buoy. The buoy sends several types of data by using IoT sensors, and there are a lot of null or fault data. To improve the quality of data, we detect the blank and fault data in processes of data check and fault data detection by using lower bounds and upper bounds obtained from the NGBoost-I algorithm. The detected data is changed to null data, and the null data is interpolated by 5 machine learning algorithms. The performance of the proposed framework was verified through the 4 test scenarios based on the actual buoy in Korea, and the combination of 95% of PI of NGBoost-I and LSTM gave the best results.

Through this study, we confirmed that the suggested framework can be used for several types of data in buoys by using historical data of sensors. If we consider the multiple numbers of data sets from sensors instead of a single buoy, the performance of improvement processes is expected to improve. However, this study only considered a period of data within 2 months. If the period is increased, there might be several additional considerations such as seasonality. Also, the data comes from IoT sensors, so a case of data missing over a long-term period may occur. In this case, it may be necessary to improve the framework to reflect various situations in real- world cases.

To increase the results of the suggested framework, future research could consider several cases that can occur in data from sensors. For instance, If all columns of data gathered at the same time are outliers, the suggested framework may give inappropriate results for interpolation. In addition, if the value of the errors gradually increases (drift), the suggested framework may recognize continuous defect data as normal and fail to correct it. Therefore, we may add a feedback process to ensure stability.

Acknowledgment

This research was supported by Korea Institute of Marine Science & Technology Promotion (KIMST) funded by the Ministry of Oceans and Fisheries (20210650). This work was also supported by “Human Resources Program in Energy Technology” of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea. (No. 202106540003)

Figure

<Figure 1>.

Locations of Weather Buoys in Korea

<Figure 2>.

An Example of a Weather Buoy with IoT Sensors at Sea

<Figure 3>.

Flow Chart of the Proposed Framework

<Figure 4>.

Examples of Generating Outliers with Increasing/Decreasing the Original Values by 10%

<Figure 5>.

Result of NGBoost-I with 95% PI and LSTM on Real Data

Table

<Table 1>.

Input data of NGBoost-I

Input data	Description
WIND_SPEED	‘WIND_SPEED’ value at the sametime
HUMIDITY	‘HUMIDITY’ value at the sametime
AIR_PRESSURE	‘AIR_PRESSURE’ value at the sametime
WIND_SPEED_mean	Arithmetic mean of the most recent 30 minutes’ data of WIND_SPEED
WIND_SPEED_std	Standard deviation of the most recent 30 minutes’ data of WIND_SPEED
WIND_SPEED_max	Maximum of the most recent 30 minutes’ data of WIND_SPEED
WIND_SPEED_min	Minimum of the most recent 30 minutes’ data of WIND_SPEED
AIR_TEMPERATURE_mean	Arithmetic mean of the most recent 30 minutes’ data ofAIR_TEMPERATURE
AIR_TEMPERATURE _std	Standard deviation of the most recent 30 minutes’ data ofAIR_TEMPERATURE
AIR_TEMPERATURE _max	Maximum of the most recent 30 minutes’ data ofAIR_TEMPERATURE
AIR_TEMPERATURE _min	Minimum of the most recent 30 minutes’ data ofAIR_TEMPERATURE
HUMIDITY_mean	Arithmetic mean of the most recent 30 minutes’ data of HUMIDITY
HUMIDITY_std	Standard deviation of the most recent 30 minutes’ data of HUMIDITY
HUMIDITY_max	Maximum of the most recent 30 minutes’ data of HUMIDITY
HUMIDITY_min	Minimum of the most recent 30 minutes’ data of HUMIDITY
AIR_PRESSURE_mean	Arithmetic mean of the most recent 30 minutes’ data of AIR_PRESSURE
AIR_PRESSURE_std	Standard deviation of the most recent 30 minutes’ data of AIR_PRESSURE
AIR_PRESSURE_max	Maximum of the most recent 30 minutes’ data ofAIR_PRESSURE
AIR_PRESSURE_min	Minimum of the most recent 30 minutes’ data ofAIR_PRESSURE

<Table 2>.

Input Data of Machine Learning Models for Interpolation

Algorithm(s)	Inputdata
NG Boost, XG Boost, CatBoost, Random Forest	WIND_SPEED
HUMIDITY
AIR_PRESSURE
AIR_TEMPERATURE_mean
AIR_TEMPERATURE_std
AIR_TEMPERATURE_max
AIR_TEMPERATURE_min
AIR_TEMPERATURE_min
Long Short Term Memory (LSTM)	AIR_TEMPERATURE_mean
AIR_TEMPERATURE_std
AIR_TEMPERATURE_max
AIR_TEMPERATURE_min

<Table 3>.

Results of Detecting Outliers of Scenario 1 (3%)

Algorithms	PI(%)	Accuracy	Precision	Recall	F1Score	TN	TP	FN	FP
Cat Boost	90	0.87	0.65	0.88	0.74	968	280	152	40
95	0.86	0.54	0.99	0.70	1005	233	199	3
99	0.80	0.34	1.00	0.51	1008	148	284	0
99.5	0.81	0.37	1.00	0.54	1008	159	273	0
LSTM	90	0.87	0.67	0.87	0.76	965	289	143	43
95	0.86	0.54	0.97	0.69	1002	232	200	6
99	0.80	0.35	1.00	0.52	1008	150	282	0
99.5	0.81	0.36	1.00	0.53	1008	157	275	0
NGBoost-Ⅱ	90	0.87	0.75	0.82	0.78	936	322	110	72
95	0.86	0.55	0.98	0.71	1004	239	193	4
99	0.80	0.35	1.00	0.52	1008	150	282	0
99.5	0.81	0.36	1.00	0.53	1008	157	275	0
Random Forest	90	0.83	0.73	0.72	0.72	885	314	118	123
95	0.86	0.54	1.00	0.70	1007	235	197	1
99	0.80	0.34	1.00	0.51	1008	147	285	0
99.5	0.81	0.36	1.00	0.53	1008	156	276	0
XGBoost	90	0.88	0.63	0.94	0.76	989	274	158	19
95	0.86	0.53	1.00	0.70	1007	231	201	1
99	0.80	0.34	1.00	0.51	1008	146	286	0
99.5	0.81	0.36	1.00	0.53	1008	154	278	0

<Table 4>.

Results of Detecting Outliers of Scenario 2 (5%)

Algorithms	PI(%)	Accuracy	Precision	Recall	F1Score	TN	TP	FN	FP
Cat Boost	90	0.85	0.85	0.71	0.77	855	369	63	153
95	0.92	0.74	0.98	0.85	1003	321	111	5
99	0.82	0.40	1.00	0.57	1008	173	259	0
99.5	0.83	0.45	1.00	0.62	1008	193	239	0
LSTM	90	0.92	0.96	0.80	0.87	906	414	18	102
95	0.97	0.95	0.94	0.94	983	409	23	25
99	0.83	0.44	1.00	0.61	1008	188	244	0
99.5	0.88	0.59	1.00	0.74	1008	253	179	0
NG Boost-Ⅱ	90	0.76	0.91	0.57	0.70	706	394	38	302
95	0.96	0.93	0.94	0.93	981	400	32	27
99	0.83	0.42	1.00	0.59	1008	180	252	0
99.5	0.88	0.59	1.00	0.74	1008	253	179	0
Random Forest	90	0.97	0.98	0.94	0.96	979	424	8	29
95	0.97	0.90	0.99	0.94	1004	388	44	4
99	0.82	0.40	1.00	0.57	1008	173	259	0
99.5	0.85	0.50	1.00	0.66	1008	214	218	0
XG Boost	90	0.95	0.88	0.96	0.92	991	379	53	17
95	0.91	0.69	1.00	0.82	1007	300	132	1
99	0.81	0.38	1.00	0.55	1008	165	267	0
99.5	0.84	0.45	1.00	0.62	1008	195	237	0

<Table 5>.

Results of Detecting Outliers of Scenario 3 (7%)

Algorithms	PI(%)	Accuracy	Precision	Recall	F1Score	TN	TP	FN	FP
Cat Boost	90	0.94	0.99	0.83	0.90	921	427	5	87
95	0.98	0.94	0.99	0.96	1003	405	27	5
99	0.90	0.68	1.00	0.81	1008	295	137	0
99.5	0.90	0.66	1.00	0.80	1008	285	147	0
LSTM	90	0.96	1.00	0.89	0.94	954	431	1	54
95	0.98	0.99	0.94	0.97	983	427	5	25
99	0.95	0.82	1.00	0.90	1008	354	78	0
99.5	0.96	0.86	1.00	0.93	1008	372	60	0
NG Boost-Ⅱ	90	0.77	0.94	0.57	0.71	705	405	27	303
95	0.96	0.99	0.88	0.93	952	427	5	56
99	0.92	0.73	1.00	0.84	1008	315	117	0
99.5	0.97	0.91	1.00	0.95	1008	391	41	0
Random Forest	90	0.98	1.00	0.94	0.97	978	432	0	30
95	0.99	0.99	0.99	0.99	1004	427	5	4
99	0.92	0.73	1.00	0.84	1008	314	118	0
99.5	0.97	0.88	1.00	0.94	1008	382	50	0
XG Boost	90	0.97	0.97	0.94	0.96	982	420	12	26
95	0.97	0.92	0.98	0.95	1000	398	34	8
99	0.89	0.64	1.00	0.78	1008	275	157	0
99.5	0.90	0.68	1.00	0.81	1008	292	140	0

<Table 6>.

Results of Detecting Outliers of Scenario 4 (10%)

Algorithms	PI(%)	Accuracy	Precision	Recall	F1Score	TN	TP	FN	FP
Cat Boost	90	0.89	1.00	0.73	0.85	851	432	0	157
95	0.99	1.00	0.97	0.98	994	432	0	14
99	1.00	1.00	1.00	1.00	1008	432	0	0
99.5	1.00	1.00	1.00	1.00	1008	432	0	0
LSTM	90	0.94	1.00	0.83	0.91	921	432	0	87
95	0.98	1.00	0.94	0.97	980	432	0	28
99	1.00	1.00	1.00	1.00	1007	432	0	1
99.5	1.00	1.00	1.00	1.00	1008	432	0	0
NG Boost-Ⅱ	90	0.96	1.00	0.89	0.94	954	432	0	54
95	0.98	1.00	0.94	0.97	979	432	0	29
99	1.00	1.00	1.00	1.00	1008	432	0	0
99.5	1.00	1.00	1.00	1.00	1008	432	0	0
Random Forest	90	0.96	1.00	0.88	0.94	949	432	0	59
95	0.98	1.00	0.95	0.97	985	432	0	23
99	1.00	1.00	1.00	1.00	1008	432	0	0
99.5	1.00	1.00	1.00	1.00	1008	432	0	0
XG Boost	90	0.90	1.00	0.74	0.85	859	432	0	149
95	0.99	1.00	0.98	0.99	997	432	0	11
99	0.99	0.95	1.00	0.98	1008	411	21	0
99.5	0.96	0.85	1.00	0.92	1008	369	63	0

<Table 7>.

MAPE of the Proposed Algorithms for 4 Scenarios

Algorithms	PI(%)	SC1(3%)	SC2(5%)	SC3(7%)	SC4(10%)	Overall
Cat Boost	90	0.69	1.23	0.72	0.93	0.89
95	0.67	0.81	0.61	0.34	0.61
99	0.79	1.26	1.15	0.29	0.87
99.5	0.77	1.21	1.30	0.29	0.90
average	0.73	1.13	0.95	0.46	0.82
LSTM	90	0.57	0.55	0.35	0.45	0.48
95	0.56	0.38	0.32	0.31	0.39
99	0.68	0.98	0.59	0.23	0.62
99.5	0.68	0.80	0.52	0.23	0.55
average	0.62	0.68	0.44	0.31	0.51
NG Boost-Ⅱ	90	0.63	1.67	1.99	0.43	1.18
95	0.59	0.45	0.51	0.37	0.48
99	0.69	1.04	0.83	0.29	0.71
99.5	0.69	0.83	0.47	0.29	0.57
average	0.65	1.00	0.95	0.34	0.74
Random Forest	90	0.91	0.39	0.37	0.52	0.55
95	0.62	0.45	0.35	0.39	0.45
99	0.71	1.14	1.04	0.33	0.80
99.5	0.70	1.02	0.59	0.33	0.66
average	0.73	0.75	0.59	0.39	0.62
XGBoost	90	0.68	0.76	0.73	1.21	0.85
95	0.72	0.97	0.77	0.68	0.78
99	0.78	1.25	1.36	0.77	1.04
99.5	0.79	1.19	1.32	1.17	1.12
average	0.74	1.04	1.05	0.96	0.95

Reference

Abdel-Aal, R.E., Hourly temperature forecasting using abductive networks, Engineering Applications of Artificial Intelligence, 2004, Vol. 17, No. 5, pp. 543-556.
Barnett, V. and Lewis, T., Outliers in statistical data, Wiley New York, 1994, pp. 7-33.
Bidlot, J.R., Holmes, D.J., Wittmann, P.A., Lalbeharry, R., and Chen, H.S., Intercomparison of the performance of operational ocean wave forecasting systems with buoy data, Weather and Forecasting, 2002, Vol.17. No.2, pp. 287-310.
Chakraborty, D. and Elzarka, H., Advanced machine learning techniques for building performance simulation: a comparative analysis, Journal of Building Performance Simulation, 2019, Vol. 12, No. 2, pp. 193-207.
Cifuentes, J., Marulanda, G., Bello, A., and Reneses, J., Air temperature forecasting using machine learning techniques: A review, Energies, 2020, Vol. 13, No. 16, pp. 4215.
Dai, X. and Gao, Z., From model, signal to knowledge: A data-driven perspective of fault detection and diagnosis, IEEE Transactions on Industrial Informatics, 2013, Vol.9, No.4, pp. 2226-2238.
Duan, T., Anand, A., Ding, D.Y., Thai, K.K., Basu, S., Ng, A., and Schuler, A., Ngboost: Natural gradient boosting for probabilistic prediction, In International Conference on Machine Learning, 2020, pp. 2690-2700.
Ebuchi, N., Graber, H.C., and Caruso, M.J., Evaluation of wind vectors observed by QuikSCAT/SeaWinds using ocean buoy data, Journal of Atmospheric and Oceanic Technology, 2002, Vol.19, No.12, pp. 2049-2062.
Fang, Z., Crimier, N., Scanu, L., Midelet, A., Alyafi, A., and Delinchant, B., Multi-zone indoor temperature prediction with LSTM-based sequence to sequence model, Energy and Buildings, 2021, Vol. 245, pp. 111053.
Ge, Z., Song, Z., and Gao, F. Review of recent research on data-based process monitoring, Industrial & Engineering Chemistry Research, 2013, Vol. 52, No.10, pp. 3543-3562
Heidemann, J., Ye, W., Wills, J., Syed, A., and Li, Y., Research challenges and applications for underwater sensor networking, In: IEEE Wireless Communications and Networking Conference 2006, WCNC 2006. IEEE, 2006. pp. 228-235.
Hochreiter, S. and Schmidhuber, J., Long short-term memory, Neural Computation, 1997. Vol. 9, No. 8, pp. 1735-1780.
Jeong, S., Park, I., Kim, H.S., Song, C.H., and Kim, H.K., Temperature prediction based on bidirectional long short-term memory and convolutional neural network combining observed and numerical forecast data, Sensors, 2021, Vol. 21, No.3, pp. 941.
Khan, M.I. and Maity, R., Hybrid deep learning approach for multi-step-ahead prediction for daily maximum temperature and heatwaves, Theoretical and Applied Climatology, 2022. Vol. 149, No. 3-4, pp. 945-963.
Leonard, N.E., Paley, D.A., Lekien, F., Sepulchre, R., Fratantoni, D.M., and Davis, R.E., Collective motion, sensor networks, and ocean sampling, Proceedings of the IEEE, 2007, Vol. 95, No. 1, pp. 48-74.
Li, D., Wang, Y., Wang, J., Wang, C., and Duan, Y. Recent advances in sensor fault diagnosis: A review, Sensors and Actuators A: Physical, 2020, Vol. 309, pp. 111990.
Liu, J., Zhang, T., Gou, Y., Wang, X., Li, B., and Guan, W., Convolutional LSTM networks for seawater temperature prediction, 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), 2019, IEEE.
Qin, S.J., Survey on data-driven industrial process monitoring and diagnosis, Annual Reviews in Control, 2012, Vol. 36, No. 2, pp. 220-234.
Reynolds, R.W., A real-time global sea surface temperature analysis, Journal of Climate, 1988, Vol.1, No.3, pp. 75-87.
Venkatesan, R., Ramesh, K., Muthiah, M.A., Thirumurugan, K., and Atmanand, M.A., Analysis of drift characteristic in conductivity and temperature sensors used in Moored buoy system, Ocean Engineering, 2019, Vol. 171, pp. 151-156.
Wang, R., Lu, S., and Li, Q. Multi-criteria comprehensive study on predictive algorithm of hourly heating energy consumption for residential buildings, Sustainable Cities and Society, 2019, Vol. 49, pp. 101623.
Xiang, L., Wang, P., Yang, X., Hu, A., and Su, H. Fault detection of wind turbine based on SCADA data analysis using CNN and LSTM with attention mechanism, Measurement, 2021, Vol. 175, pp. 109094.
Yin, S., Wang, G., and Karimi, H.R., Data-driven design of robust fault detection system for wind turbines, Mechatronics, 2014, Vol. 24, No.4, pp. 298-306.
Zamora-Martinez, F., Romeu, P., Botella-Rocamora, P., and Pardo, J., On-line learning of indoor temperature forecasting models towards energy efficiency, Energy and Buildings, 2014, Vol. 83, pp. 162-172.
Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B. and Si, Y., A data-driven design for fault detection of wind turbines using random forests and XGboost, IEEE Access, 2018, Vol. 6, pp. 21020-21031.