## 1. Introduction

Weather buoys floating near ports play as a warning sign of reefs or dangerous zones for ships approaching ports. Also, the buoys emit light to guide ships into or out of ports in bad weather. Usually, weather condition near the coast changes suddenly and it is frequently foggy, rainy, or windy. Thus, the role of buoys is very important. Particularly, South Korea, which is dealt with as a case study in this paper, is a peninsular surrounded by sea on three sides. Hence, there are many ports along the coast and lots of buoys are installed near the ports. In <Figure 1>, red points indicate the locations of buoys installed on the sea of Korea.

Recently, weather buoys have been equipped with the Internet of Things (IoT) sensors and wireless communication devices. Sensors measure meteorological data (e.g., air and water temperature, air pressure, humidity, and wind speed) and the buoy’s status (e.g., battery voltage and electricity, and solar charging). Then, wireless communication devices transmit the data to the central database in a ground control center and the ships nearby at irregular intervals. <Figure 2> is an example of a buoy with sensors and a wireless communication device. Ships use the data provided by the buoys to plan their routes and schedules. The accuracy of the data is therefore essential to fulfill the purpose of the buoy as a navigational aid.

The sensors considered in this study observe air and water temperature, air pressure, humidity, wind speed, and data collection time and send them by wireless. Usually, meteorological data shows complex patterns with daily variation and seasonality. In particular, the weather at sea changes more suddenly than the weather on land. In addition, weather buoys are not stationary but float on the sea, and they are always exposed out-of-door. Besides, the buoys are very far from the ground control center. Owing to these reasons, transmission failure may often, or abnormal data is transmitted frequently. According to Li et al. [16], these erroneous data from sensors can be divided into two types: incipient failure and abrupt failure. The former causes null data, the latter makes abnormal values (outliers). These errors make it difficult for users to properly utilize the collected data. Therefore, it is important to detect and correct errors such as nulls, and outliers.

This study focuses on the quality of meteorological data transmitted from weather buoys. Ships receive the meteorological data from nearby buoys, and they make routes and schedules and check safety. However, as mentioned, the sensors of the buoys occasionally experience incipient and abrupt failures. When ships receive data including errors such as null and outliers or do not receive any data, they request normal data from a ground control center. Therefore, errors in meteorological data in the ground control center must be detected and interpolated in real time to provide reliable data whenever ships request it. To achieve high reliability in the buoy’s data, data cleansing is necessary not to contain errors.

To improve the quality of meteorological data from weather buoys, this study develops a framework of methods to methods to detect and interpolate erroneous data to provide the ships at sea with reliable data, and it should be processed immediately right after receiving the data. The framework is based on machine learning models for data preprocessing, error detection, and interpolation. After training the machine learning models with historical meteorological data, the framework can detect and interpolate the errors instantly as soon as receiving the data. Also, as Barnett and Lewis [2] defined, there may need to use several kinds of algorithms to remove fault data and interpolate as new forecasting data. Therefore, we develop a new framework to improve the quality of air temperature data above the sea to integrate appropriate machine learning algorithms.

This paper is organized as follows. Section 2 introduces related studies and specifies its limitations. Section 3 introduces the methods sequentially performed in the proposed framework, and Section 4 validates the performance of the suggested framework using data collected from buoys on the sea of Korea. Finally, Section 5 presents the conclusions and further research of this study.

## 2. Related Studies and limitations

Buoys have been used for a few decades, so there are many studies on the uses of buoy data for various objectives. Among them, Reynolds [19] suggested a platform for monitoring the sea surface temperature using satellites. Buoy data can be used as a comparison group to check the quality of data from satellites [8] or several types of forecasting systems [3]. From, they use buoy data itself. Heidemann et al. [11] introduced research challenges related to underwater sensor networking, while [15] focused on the collective motion of the ocean sensor network. Among these researches, Venkatesan et al. [20] focused on drift characteristics of temperature sensors of buoys, and this means that if the data is not calibrated to normal data, the quality of the data may continue to deteriorate.

Among many related studies, our suggested problem is more related to data fault detection. Barnett and Lewis [2] defined *outliers* as observations (or subsets of observations) that appear to be inconsistent with the rest of the data in a given dataset. Qin [18], Ge et al. [10], and Dai and Gao [6] introduced several related studies related to fault detection and diagnosis that have been considered in the areas of monitoring industrial processes such as safety, quality, and operation efficiency, etc. For other applications, Yin et al. [23], Zhang et al. [25], and Xiang et al. [22] developed a convolution neural network, XGboost, and LSTM for fault detection of wind turbine data, respectively.

Our suggested problem also considers the replacement of fault data as a normal value, so the forecasting algorithm is also needed. There have been a lot of related studies for data forecasting, so we focused on temperature forecasting research. Abdel-Aal [1] proposed alternative abductive networks approach to forecast next-day and next-hour temperatures. Zamora-Martinez et al. [24] proposed an online learning model for forecasting indoor temperatures. Cifuentes et al. [5] reviewed related studies related to air and water temperature forecasting using machine learning techniques from short-term like daily and hourly to long-term global temperature forecasting global warming during decades.

Although there are a lot of related studies related to data cleansing, data fault detection, and forecasting, it is important to integrate methodologies to find fault data and interpolate them. This study can contribute to the literature by developing a framework that can operate both fault detection and interpolation simultaneously with the different machine learning models. It can also contribute to identifying errors and interpolating them to appropriate values in real time as data is collected.

## 3. Models for Data Quality Improvement

This section describes the considered problem in more detail and proposes a framework to improve data quality. This study considers meteorological data transmitted from sensors of weather buoys. Every weather buoy in Korea collects air temperature, air pressure, humidity as a default, and wind speed. Among the various data, this study focuses on air temperature data because it is the most effective factor for sensing the changes in weather in the sea. Note that air temperature is the most important indicator of changes in atmospheric circulation and is a factor that can represent the meteorological conditions of the sea.

Generally, temperature data recorded successively on land shows trends and has small variations. However, because weather buoys are floating and communicate wirelessly with a distant ground control center or ships, the variation is larger than that in land and the intervals of time stamps in the database are not regular. Additionally, the data includes erroneous values such as outliers and nulls, where an outlier is an observation that lies far away from the central of a given data set, and null data represents the intentional absence of any object value. Thus, this study aims to develop a framework to improve the quality of the air temperature data from weather buoys. <Figure 3> shows a flow chart of the proposed framework with several models to detect erroneous values and interpolate them.

First, ‘Data Collection’ step collects and stores the data transmitted from the weather buoys. Note that this study assumes the data is already given and hence the procedure of the framework starts at the second step ‘Data Check’ to make the irregular intervals between two successive time stamps regular. By scanning the data and counting the frequencies of time intervals, it finds the most frequent interval among the ones between two successive time stamps, which is called normal interval. Then, if there are time stamps within the normal interval, they are removed. Otherwise, if an interval between two timestamps is longer than normal, null data is inserted between the two timestamps to make up for the time gap. After ‘Data Check’, ‘Data Preparation’ for machine learning is performed. In this step, additional input values such as mean, standard deviation, and min/max of each column are generated along with existing column values. The next step ‘Fault Detection’ detects outliers, excepting nulls. Two methods are used to detect outliers: the interquartile range method and NGBoost. Using these methods, the detected outliers are substituted by null at step ‘Fault Data Remove’. Finally, ‘Data Interpolation’ replaces the nulls with proper values using machine learning models: CatBoost, XGBoost, Random Forest, NGBoost, and long-short term memory (LSTM). These models are used due to the type of error. If only the temperature data (target data) is removed or missing, the CatBoost, XGBoost, Random Forest, and NGBoost algorithms can interpolate the object value by using other data sets such as humidity, and air pressure as input data. Note that, unlike the 'Fault Detection' step in the flowchart, we didn’t specify an algorithm in the ‘Data Interpolation’ step because only one of the proposed algorithms is used in the framework. The subsequent subsections give a detailed description of the suggested models: interquartile range and NGBoost-I for detecting outliers, and LSTM, CatBoost, XGBoost, Random Forest, and NGBoost-II for interpolation.

### 3.1 Detecting outliers

We use two outlier removal algorithms: interquartile range (IQR) and NGBoost-I. These algorithms run sequentially.

#### 3.1.1 Interquartile range (IQR)

The interquartile range (IQR) is a robust statistical method used to measure the spread or dispersion of a dataset by focusing on the middle 50% of the dataset. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1) of the dataset. Quartiles are values that divide a dataset into four equal parts, with Q1 representing the 25th percentile and Q3 representing the 75th percentile. In this model, Q1-1.5×IQR is used as a lower bound while Q3+1.5×IQR is used as an upper bound. By discarding the extreme values in the dataset, the interquartile range provides a more reliable measure of variability, particularly in the presence of outliers or skewed data. In our observed sensor data, there are instances where values are not null, but due to sensor errors, extreme values are recorded. Additionally, there are cases where the sensor fails to provide data, leading to periodic output of pre-defined default values. In such scenarios, the Interquartile Range (IQR) proves to be an effective method for filtering out anomalies. In this study, we leverage the IQR algorithm based on the Q1 and Q3 values of the data from the past 7 days, and values under the lower bound and above the upper bounds are detected as outliers.

#### 3.1.2 NGBoost-I

The Natural Gradient Boosting (NGBoost) algorithm is a machine learning algorithm that combines the principles of gradient boosting with a probabilistic framework suggested by Duan et al. [7]. It's designed to perform probabilistic regression and classification tasks while also providing predictive uncertainty estimates for its predictions. Traditional gradient boosting algorithms, such as the popular XGBoost and LightGBM, focus on minimizing the mean squared error or other point-wise loss functions to improve predictive accuracy. NGBoost, on the other hand, extends this concept by optimizing a probabilistic loss function. This means that NGBoost doesn't just provide point predictions, but also models the entire predictive distribution.

NGBoost algorithm consists of three steps, each of which is ‘base learners’, ‘probability distribution’, and ‘scoring rule’. The base learner is a step for making a decision tree, while the probability distribution step makes a distribution by using normal distribution or Bernoulli distribution. Note that air temperature has continuous values, so Normal distribution is used. Then, the scoring rule evaluates the performance of the prediction.

NGboost can give the distribution of the prediction values, and it can be used well to find outlier data. NGboost algorithm gives not only a prediction value but also its upper and lower bounds. If the value is under the lower bound or over the upper bound, it can be detected as an outlier value. The gap between the upper and lower bound can be manipulated by controlling the confidence interval. NGBoost-I, which we propose, utilizes 19 types of input data presented in <Table 1>.

### 3.2 Data Interpolation

After detecting a single outlier, the interpolation process should be operated to interpolate the null values. The interpolation requires the latest 30 data, but if outlier values are used as they are or null values exist, the accuracy of the NGBoost model inevitably decreases. Therefore, it is necessary to immediately substitute an interpolated value after an error is detected. The temperature data we need to interpolate exhibits a recurring trend over the course of a day, making accurate interpolation challenging with basic techniques such as moving averages or exponential smoothing. We use another NGBoost algorithm (NGBoost-Ⅱ), Long Short-Term Memory (LSTM), XGBoost, and CatBoost for interpolation and compare the performance of these algorithms.

<Table 2> shows the input values of each model for interpolation. Note that in the case of NGBoost-Ⅱ, we reduce input parameters from NGBoost-I. The reason is that the purpose of NGBoost-I is to detect outliers, and NGBoost-Ⅱ is focused on finding appropriate interpolate values. Due to our pre-test, the performance of boosting algorithms is increased when we remove 12 input columns (average, standard deviation, minimum, and maximum) of HUMIDITY, WIND_SPEED, and AIR_PRESSURE). If we do not consider all input values of a sensor in NGBoost- I, it may detect the normal value as an ‘outlier’. Therefore, NGBoost-I can give robust values (wide width of prediction interval), and NGBoost-Ⅱ gives the proper values with narrow width of the prediction level.

Also, since Hochreiter and Schmidhuber [12] introduced long short-term memory (LSTM), there have been many studies for temperature data that LSTM-based algorithms give a good forecasting performance compared to existing forecasting models [9,13,14,17]. Therefore, we also use LSTM for forecasting null values that were detected and removed in the related steps. LSTM only uses 30 related AIR_TEMPERATURE values to predict the next data.

We employ Random Forest, CatBoost, and XGBoost methods that fall under the ensemble learning methodology. Their fundamental concept involves amalgamating multiple basic models known as weak learners to create a robust model capable of achieving enhanced predictive accuracy. These approaches have emerged as a promising alternative in the field of building energy efficiency and have garnered support through numerous studies, such as those demonstrating their efficacy in forecasting energy consumption [21] and predictive energy models [4].

## 4. Computational Experiments

This section reports the evaluation of the suggested models of IQR and NGBoost-I for detecting outliers and NGBoost-II, XGBoost, CatBoost, Random Forest, and LSTM for interpolation to improve data quality transmitted from a weather buoy. For the evaluation, we used the ‘AIR_TEMPERATURE’ data of August 30, 2022. The data is transmitted every minute, so there are 1440 values in a day. Among them, 30% were randomly selected and replaced with three types of outliers (0, large, and small), each with 10%. It then explores the outlier data representing the rate of increase/decrease using a total of 4 values (3%, 5%, 7%, 10%). Therefore, there are 4 scenarios in total. <Figure 4> gives examples of generated data based on the real-world case of a marine buoy. In <Figure 4>, The x-axis of the graph represents time, and the y-axis represents temperature. Red marks mean the generated outliers and blue dots are the original/normal data of ‘AIR_TEMPERATURE’.

As we mentioned in the related section, the NGBoost algorithm provides different upper and lower bounds depending on the width of the prediction interval. Therefore, we test the accuracy of outlier detection by changing the probability of the prediction intervals to 90%, 95%, 99%, and 99.5%. Also, since the interpolated values are used for the next outlier detection, outlier detection results may be affected by the type of model used for interpolation. Hence, it is essential to evaluate and compare the effectiveness of each combination of prediction interval (PI) and interpolation algorithms for both fault detection and data interpolation, rather than conducting separate validations for each. We tested 4 levels of PI (90%, 95%, 99%, and 99.5%) and 5 kinds of interpolation algorithms (CatBoost, NGBoost, XGBoost, Random Forest, and LSTM) within 4 scenarios. Therefore, there are 80 combinations of results for finding the most effective methodology.

In our framework, it is important to detect outliers along with an interpolation step. If an outlier is not detected and removed properly, it is mistaken for a normal value and predicts the next value, resulting in a very large interpolation error. Therefore, the width of the prediction interval of NGBoost-I, which determines the outlier, is important, and the prediction rate according to the width of the prediction interval should be compared with the results. Therefore, we compare accuracy, precision, and recall to see if they are suitable. The results of the four scenarios for each prediction interval are summarized in <Table 3> ~ <Table 6>.

<Table 3> ~ <Table 6> show the accuracy, precision, recall, and F1-score of all combinations. Accuracy is the ratio of the number of correct predictions to the total number of fault data that we generated. If the accuracy is 1.00, all outlier data are detected. True Positive (TP) is the number of true outlier values determined by the model to be outliers. Conversely, TN is the number of actual non-outliers that the model judges are not outliers. False Positive (FP) is when a model wrongly predicts a positive outcome. False Negative (FN) is when a model wrongly predicts a negative outcome. These terms are crucial for evaluating classification model performance. Using these columns, we can calculate precision and recall value. Precision is the accuracy of positive predictions by measuring the proportion of correct positive predictions out of all predicted positives, and recall evaluates a model's ability to identify all actual positives by calculating the proportion of correct positive predictions out of all actual positives.

As shown in <Table 3>, the accuracy and precision of 90% PI are better than other PI values in scenario 1. Scenario 1 generated outliers with 3% increase or decrease, which means normal and abnormal (outlier) data is the closest among 4 scenarios. By applying Prediction Intervals (PI) of 90%, we observe that the NGBoost-Ⅰ algorithm demonstrates the narrowest gap between the upper and lower bounds of normal data points. Consequently, this approach proves effective in filtering out anomalies that are closely situated to normal data points. However, we observe that the False Negative (FN) values are the highest at 90%. This observation suggests that due to narrow prediction intervals, instances of misclassifying normal data points as anomalies are prevalent. Across all algorithms, a consistent pattern emerges, with the combination of NGBoost with a 90% PI and XGBoost yielding the most promising results. On the other hand, at PIs of 99% and 99.5%, most normal data points are correctly identified, although many anomalies go undetected.

Scenario 2 (5%) in <Table 4> exhibits higher accuracy and precision compared to scenario 1. Unlike scenario 1, scenario 2 performs best at a 95% PI. Among the algorithms in this scenario, LSTM and Random Forest achieve the highest accuracy of 0.97, with LSTM displaying the highest precision. <Table 5> and <Table 6> present the results of scenarios 3 and 4 respectively. These scenarios demonstrate improved algorithm performance due to clearer differentiation between anomalies and normal data points compared to scenarios 1 and 2. Scenario 3 shows that most algorithms, except NGBoost-II, perform well at PIs of 90% and 95%. However, as the PI increases to 99% and 99.5%, the number of False Positives (FP) substantially rises, adversely affecting model precision. In <Table 6>, scenario 4, characterized by the largest distinction between anomalies and normal data points, displays a remarkable ability of PIs at 99% and 99.5% to almost entirely filter out anomalies, except for XGBoost which has non-zero FP at these PIs.

<Table 7> shows the mean absolute percentage error (MAPE) of all combinations of 4 scenarios. Note that in <Table 7>, SC1 (3%) refers to a scenario in which outliers are generated by increasing or decreasing the original data by 3%.

LSTM shows the best performance in 15 cases out of 16. Random Forest gives the best result in SC2 with 90% PI. When comparing average values, LSTM gave the best result with 0.51%. Random Forest gives 0.62%. The results of 3 kinds of boosting algorithms followed random forest. The results mean that LSTM is a suitable methodology for predicting temperature data, and related temperature values are enough to predict the next one. In particular, the temperature data of a weather buoy used in this study are often similar to related data due to the short data collection interval. Therefore, LSTM, which reflects the characteristics of time series data well, can be considered more effective.

Based on the accuracy and MAPE results, it is deduced that when the percentage of anomalies exceeds 7%, PIs of 99% or 99.5% should be employed. For scenarios within this threshold, a PI of 95% proves optimal in revealing errors. However, considering the irregular distribution of anomaly gaps in real-world data and the possibility of data points closely resembling true values, a 95% PI is deemed appropriate. Although both Random Forest and LSTM exhibit similar capabilities in accurately filtering anomalies, MAPE considerations suggest that LSTM is better suited for interpolation. Hence, the proposed optimal approach involves utilizing the NGBoost-I algorithm with a 95% PI for anomaly detection, followed by LSTM for the removal of detected anomalies and subsequent interpolation. This approach is considered superior.

We tested the NGBoost-I with 95% PI and LSTM on a real dataset on August 31, 2022. In <Figure 5>, the purple width represents the 95% prediction interval by NGBoost-I. The blue dots in the red circles outside the prediction interval are judged to be outliers, which are replaced by the interpolated values in the orange dots. When applied to the real-world model, our model produced relatively reasonable results and the MAPE was 0.17%.

## 5. Conclusion

In this study, we developed a framework to improve the quality of data transmitted from a weather buoy. The buoy sends several types of data by using IoT sensors, and there are a lot of null or fault data. To improve the quality of data, we detect the blank and fault data in processes of data check and fault data detection by using lower bounds and upper bounds obtained from the NGBoost-I algorithm. The detected data is changed to null data, and the null data is interpolated by 5 machine learning algorithms. The performance of the proposed framework was verified through the 4 test scenarios based on the actual buoy in Korea, and the combination of 95% of PI of NGBoost-I and LSTM gave the best results.

Through this study, we confirmed that the suggested framework can be used for several types of data in buoys by using historical data of sensors. If we consider the multiple numbers of data sets from sensors instead of a single buoy, the performance of improvement processes is expected to improve. However, this study only considered a period of data within 2 months. If the period is increased, there might be several additional considerations such as seasonality. Also, the data comes from IoT sensors, so a case of data missing over a long-term period may occur. In this case, it may be necessary to improve the framework to reflect various situations in real- world cases.

To increase the results of the suggested framework, future research could consider several cases that can occur in data from sensors. For instance, If all columns of data gathered at the same time are outliers, the suggested framework may give inappropriate results for interpolation. In addition, if the value of the errors gradually increases (drift), the suggested framework may recognize continuous defect data as normal and fail to correct it. Therefore, we may add a feedback process to ensure stability.