1. Introduction
Accurate forecasting of inbound foreign tourist arrivals is a critical component of tourism planning and economic policy, particularly in countries like South Korea, where international tourism significantly contributes to national revenue and cross-cultural exchange. However, generating reliable forecasts remains a challenging task due to the multifaceted nature of tourism demand, which is shaped by both temporal demand patterns and a wide range of exogenous influences. These include seasonal cycles, macroeconomic conditions, global events, and climate variations. Such factors introduce nonlinearity, abrupt disruptions, and contextual dependencies that traditional forecasting models struggle to accommodate [30].
Conventional statistical models such as ARIMA and SARIMA have been widely used for tourism demand forecasting [14, 25], but they tend to underperform in capturing nonlinear dynamics or in leveraging external variables that impact future demand [22]. While machine learning approaches like support vector machines and random forests improve predictive power by incorporating exogenous features [2, 23], they often lack the ability to model long-term dependencies or to integrate heterogeneous time-series inputs in a unified forecasting framework.
In the domain of tourism demand forecasting, deep learning models such as Long Short-Term Memory (LSTM) networks and Transformer-based architectures have recently demonstrated strong potential for modeling sequential and multivariate data through recurrent and attention-based mechanisms [2, 11]. By leveraging their ability to capture long-term dependencies and nonlinear dynamics, these models have improved forecasting accuracy in tourism settings characterized by high variability and contextual complexity. Nonetheless, when temporal and contextual features are processed within a single encoder, their representations become entangled, which may reduce robustness, particularly during structural shocks such as the COVID-19 pandemic.
To address this limitation, recent studies in other domains have proposed hybrid deep learning models that combine LSTM and Transformer modules to jointly model sequential dependencies and contextual signals [1, 32]. However, these models typically adopt a serial or stacked configuration, where both types of inputs are concatenated into a unified stream. This design produces intertwined feature representations that obscure the distinct contributions of temporal and exogenous variables, limiting generalization and hindering insight extraction under volatile conditions.
To overcome these challenges, this study proposes the Skip-Connected Temporal Contextual Deep Learning (SC-TCDL) model, designed to disentangle historical demand and future exogenous drivers. The architecture comprises two parallel branches: an LSTM branch for capturing temporal dynamics and a Transformer branch for encoding contextual information. The temporal summary produced by the LSTM is injected into the Transformer through skip connections, enriching its contextual representations. These enhanced contextual features are subsequently concatenated with the LSTM summary and passed to a multi-layer perceptron (MLP) to generate 12-month ahead forecasts. By preserving the independence of temporal and contextual signals while enabling their complementary integration, the SC-TCDL model ultimately enhances forecasting performance.
In this architecture, the LSTM branch encodes sequential patterns in past tourist arrivals and calendar-based indicators, such as seasonality and holiday effects. The Transformer branch, in turn, models forward-looking exogenous variables, such as exchange rates, temperatures, global event flags, and extreme weather indicators, through attention mechanisms and positional encodings, capturing their interdependencies across the forecast horizon.
Various approaches have been proposed to address distorted demand data during the COVID-19 pandemic. While earlier studies typically relied on excluding pandemic-era data or applying simple imputations [9], more recent research has shifted toward model-based restoration approaches [4]. Building on this direction, the present study proposes a virtual demand restoration strategy that reconstructs the disrupted period (March 2020 to December 2023) using counterfactual estimates generated by an LSTM model trained exclusively on pre-pandemic data. This approach enables the model to learn stable seasonal patterns unaffected by pandemic-induced distortions while maintaining continuity in the input sequences.
The SC-TCDL model is trained and evaluated using monthly inbound tourist arrival data for South Korea. Empirical results show that the proposed model outperforms both classical statistical models and recent DL baselines in terms of Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). Its disentangled and skip-connected architecture enhances predictive performance and contributes to more stable learning outcomes under complex and volatile conditions.
This study contributes to the literature in two main ways. First, it introduces a novel skip-connected hybrid architecture that clearly separates and selectively integrates temporal and contextual learning streams, thereby addressing the issue of feature entanglement in multivariate forecasting. Second, it proposes a pragmatic strategy for handling pandemic-related distortions through virtual demand restoration, enabling the model to learn more generalizable seasonal patterns from data disrupted by the pandemic period.
The remainder of this paper is organized as follows. Section 2 reviews related work on deep learning-based tourism demand forecasting, with a focus on hybrid modeling approaches, and provides overviews of foundational architectures such as LSTM and Transformer models. Section 3 introduces the proposed SC-TCDL model, detailing its disentangled dual-branch architecture and skip-connected fusion mechanism. Section 4 outlines the experimental design, including data preprocessing, pandemic-period demand restoration, and forecasting configuration. Section 5 presents the empirical results, encompassing benchmark comparisons, ablation analyses, contextual variable contribution analysis, and uncertainty visualizations. Finally, Section 6 concludes with a summary of contributions, practical forecasting implications, and directions for future research.
2. Related Work
2.1 Hybrid Deep Learning Models for Tourism Demand Forecasting
Tourism demand forecasting poses a challenging task due to the need to model nonlinear temporal dependencies while integrating a diverse set of contextual variables that influence future arrivals. Deep learning models, particularly those based on LSTM and Transformer architectures, have been increasingly applied to this task. Given their distinct modeling capabilities, recent research has shown growing interest in hybrid architectures that leverage the complementary strengths of both.
LSTM networks are well-suited for capturing short- to medium-term temporal dependencies and are especially effective in modeling seasonality, holiday effects, and disruptions in demand patterns. For example, Li and Cao [13] demonstrated that LSTM outperforms conventional methods like ARIMA and backpropagation neural networks in short-term tourism forecasting. Salamanis et al. [21] further improved hotel reservation forecasts by integrating weather-related contextual variables into the LSTM model. Zhang et al. [30] developed an LSTM-based deep learning model that integrates exchange rates, oil prices, stock indices, and COVID-19 cases to improve the accuracy of daily international tourist arrival forecasts at Incheon International Airport in South Korea. These studies collectively demonstrate the effectiveness of LSTM models in learning endogenous demand cycles and their robustness when supported by relevant exogenous features.
In contrast, Transformer-based models excel at learning long-range temporal dependencies and modeling complex interactions among multiple contextual variables. These capabilities are particularly beneficial in tourism forecasting, where demand is affected by a broad range of forward-looking factors, including macroeconomic indicators, weather conditions, and global events. Li et al. [12] enhanced tourism demand prediction accuracy by combining Transformer encoders with robust time-series decomposition and hyperparameter optimization. Wu et al. [28] improved model interpretability and forecasting performance under volatile tourism conditions by introducing an attention-based temporal fusion framework. These studies highlight the capacity of Transformer models to effectively encode and integrate multivariate contextual information in tourism forecasting scenarios.
In the tourism domain, recent advances have introduced hybrid architectures that combine multiple deep learning components to improve forecasting performance. He et al. [6] developed a SARIMA–CNN–LSTM model that captures both linear and nonlinear dynamics in high-frequency tourism data. Lu et al. [15] introduced a GA–CNN–LSTM framework that integrates feature extraction with sequence modeling, demonstrating improved accuracy in daily tourist flow prediction. Houria et al. [10] enhanced forecasting performance by incorporating search query data as exogenous contextual input, which was encoded using autoencoders and fed into stacked LSTM layers. Luo et al. [16] presented a CNN–BiLSTM–Attention model that jointly modeled spatial and temporal dependencies to improve predictive accuracy in a scenic region of China. Zhang et al. [31] proposed a BiLSTM–Transformer hybrid model in which both modules are used to model temporal dependencies, BiLSTM for short- and mid-term patterns and Transformer for long-range trends, without explicitly separating contextual information. Similarly, Nguyen-Da et al. [19] utilized a CNN–LSTM architecture in which pandemic-related variables, such as COVID-19 case counts, were extracted through convolutional layers and passed to LSTM for sequential modeling. Building on these developments, the present study introduces a skip-connected dual-branch hybrid model that structurally disentangles temporal and contextual dependencies by assigning LSTM to endogenous demand cycles and Transformer encoders to forward-looking exogenous variables.
2.2 The Mechanics of the LSTM Network
LSTM is a specialized recurrent neural network (RNN) architecture developed to effectively capture extended dependencies in sequential data. Traditional RNNs often face the vanishing gradient problem, which hampers the learning of relationships across distant time steps. LSTM networks address this limitation through the use of a dedicated memory cell and a set of gating mechanisms that regulate information flow over time [8].
<Figure 1> illustrates the internal structure of a typical LSTM cell. At each time step t, the LSTM receives the current input vector Xt, the previous hidden state Ht-1, and the previous cell state Ct-1. These inputs are processed through a series of gating operations:
Forget Gate:
Input Gate:
Cell State Update:
Output Gate and Hidden State:
In these expressions, σ is the sigmoid function, tanh denotes the hyperbolic tangent, and ⊙ is the element-wise product. The weight matrices W and U control the transformations from current inputs and previous states, respectively.
As shown in <Figure 1>, the gating structure enables the network to control the flow of information, deciding what to forget, what new information to incorporate, and what to output. This design allows the LSTM to retain relevant information over extended time intervals, making it highly suitable for time-series applications with pronounced seasonality and external shocks.
In practical implementations, an LSTM-based forecasting model consists of an input layer, one or more stacked LSTM layers, and a final dense output layer. These models have been widely applied in various domains including speech recognition, energy demand forecasting, financial time-series prediction, and tourism analytics [26, 29].
2.3 The Transformer Model
The Transformer model represents a significant advancement in sequential data processing, offering a parallelized and attention-driven alternative to conventional RNNs. Unlike RNN-based approaches such as LSTM networks, which process input sequences one step at a time and thus struggle to retain long-term dependencies, the Transformer architecture simultaneously attends to all positions in a sequence through a self-attention mechanism. This design not only improves computational efficiency but also allows for the modeling of complex, long-range relationships across input features [24].
At the core of this model is the multi-head self-attention mechanism, which enables the Transformer to assess the relative importance of each input position by computing multiple attention scores in parallel. Each attention head focuses on a distinct subspace of the input representation, allowing the model to capture diverse temporal patterns and interactions among variables. As shown in <Figure 2>, this mechanism distinguishes the Transformer from recurrence-based models, particularly in its ability to model long-distance dependencies without iterative computation.
Because the Transformer does not inherently encode sequence order, it relies on positional encodings to capture the relative positions of input elements. These encodings, which can be deterministic or learned, are added to the input embeddings to preserve temporal structure. Without them, the model would treat all input positions as unordered, thereby losing essential time-series characteristics.
A Transformer model typically comprises a stack of encoder blocks, each with two major components: a multi-head attention layer and a position-wise feedforward network. The attention layer allows the model to weight and combine inputs across time steps and variables, while the feedforward network applies nonlinear transformations independently to each position in the sequence. These components are integrated using residual connections and layer normalization, which help stabilize gradient flow and support deep network architectures [7].
Before being passed to the encoder stack, raw input features are projected into a high-dimensional vector space via an embedding layer. Positional encodings are then added to these embeddings, ensuring that the model can distinguish temporal locations within the input. The encoder processes this enriched input and outputs a sequence of context-aware embeddings that reflect both the temporal progression and cross-feature relationships in the data.
In this study, only the encoder component of the Transformer is adopted, in line with the goal of forecasting structured outputs from forward-looking multivariate inputs. Since there is no requirement for autoregressive generation, which is typically necessary in language modeling, the decoder stage is omitted. Encoder-only Transformer structures have been widely used in time-series forecasting due to their ability to extract global dependencies while maintaining parallel computation [20].
3. Proposed Model Architecture
3.1 Disentangled Input Streams
To implement temporal–contextual disentanglement, all input variables are categorized according to their temporal alignment and functional role within the forecasting model. The model distinguishes between two input domains: temporal variables derived from past observations and contextual variables that represent future known or projected conditions. This structural division forms the basis of a dual-stream architecture, composed of two functionally independent yet jointly optimized modules—an LSTM-based temporal encoder and a Transformer-based contextual processor.
The temporal input stream, processed by the LSTM branch, receives a fixed-length sequence of past foreign tourist arrivals together with holiday indicators and calendar event indicators. These variables represent historical endogenous dynamics up to time t. The LSTM stack extracts recurrent seasonal patterns and medium-term temporal dependencies. The final hidden state of the LSTM encoder is denoted as , where denotes the dimensionality of the LSTM hidden representation. This vector serves as the temporal summary vector, which compresses historical demand information.
In contrast, the contextual input stream, processed by the Transformer branch, comprises exogenous variables from time t+1 onward. These include exchange rates, temperature forecasts, irregular event indicators, and extreme weather indicators. Rather than merging these variables with historical inputs at the raw-input level, the proposed architecture processes them as a separate contextual stream.
This structural design distinguishes the proposed SC-TCDL model from prior hybrid architectures that combine recurrent and attention-based modules in two respects. First, historical temporal variables and future contextual variables are explicitly separated according to their temporal roles instead of being merged into a unified input stream. Second, the LSTM and Transformer branches are assigned complementary functional responsibilities: the LSTM branch compresses endogenous temporal dynamics, while the Transformer branch encodes forward-looking contextual signals. This functional separation provides the architectural basis for the skip-connected fusion mechanism described in the following subsection.
3.2 Skip-Connected Fusion and Prediction
Building on the disentangled dual-stream design, the proposed SC-TCDL model introduces a skip-connected fusion mechanism that allows temporal information extracted by the LSTM branch to influence the contextual encoding process within the Transformer branch while preserving the independence of the two input streams.
As illustrated in <Figure 3>, the LSTM branch processes historical temporal inputs and produces the temporal summary vector , whereas the Transformer branch processes forward-looking contextual inputs defined over the forecast horizon.
Let the contextual input sequence be denoted as , where denotes the prediction horizon and each represents the contextual variables for forecast step . Each contextual vector is first transformed into a latent contextual representation , where denotes the dimensionality of the contextual representation. The temporal summary vector is then concatenated with each contextual representation to form a skip-conditioned input: , , where [⋅;⋅] denotes vector concatenation.
This operation injects the compressed historical representation extracted by the LSTM branch directly into the contextual input sequence of the Transformer branch. Positional encoding is subsequently incorporated into the conditioned sequence, and the resulting inputs are processed by the stacked Transformer encoder layers. Through this mechanism, the self-attention module learns contextual relationships among future exogenous variables while being conditioned on historical temporal information.
Accordingly, the temporal summary vector plays a dual role in the proposed architecture. First, it conditions the contextual encoding process through skip-connected fusion at the Transformer input stage. Second, it is retained outside the Transformer pathway and reused at the final prediction stage.
After the Transformer encoder processes the conditioned contextual sequence, it produces a contextual representation summarizing future exogenous information. This contextual representation is then combined with the temporal summary vector to form the final integrated representation, which is subsequently fed into a MLP. The MLP serves as a nonlinear prediction layer and generates the final -step forecast of monthly inbound tourist arrivals.
Compared with conventional hybrid architectures based on raw-input fusion or output-level combination, the proposed SC-TCDL model enables structured interaction between temporal and contextual information through skip-connected conditioning while preserving the representational independence of the two input streams.
4. Experimental Design
This section describes the experimental framework, consisting of three parts: data preprocessing, pandemic-period data restoration, and forecasting configuration and training strategy.
4.1 Data Description and Preprocessing
This study employs a monthly time-series dataset spanning from February 2013 to April 2025, with the primary dependent variable being the number of foreign tourist arrivals to South Korea. Tourist arrival data are obtained from the Korea Tourism Organization's official statistics, while exogenous variables are collected from publicly available sources, including the Bank of Korea and the Korea Meteorological Administration. <Table 1> provides a summary of the definitions, units, sources, branch assignments, and preprocessing rules for all variables included in the SC-TCDL framework.
The LSTM branch is designed to capture temporal dependencies. Its input consists of a 12-month rolling window of lagged tourist arrivals as well as binary indicators representing national holidays and recurring calendar events. Accordingly, the temporal input stream comprises three feature channels: tourist arrivals, a holiday indicator, and a calendar event indicator. These variables reflect repetitive and time-bound demand influences, such as fixed-date public holidays and seasonal cultural festivals, and are therefore naturally modeled through recurrent temporal structures. By assigning these features to the LSTM branch, the model separates cyclical temporal patterns from non-recurring contextual influences, thereby reducing the risk of feature entanglement.
The Transformer branch is designed to model contextual factors that may influence tourism demand beyond historical patterns. The selected contextual variables include the average monthly exchange rate, average monthly temperature, a single monthly binary irregular event indicator, and a single monthly binary indicator for extreme weather conditions. These variables were selected to reflect three broad categories of tourism demand drivers frequently discussed in the literature: economic and price-related conditions, climate-related factors, and irregular external shocks affecting international travel.
More specifically, the irregular event indicator takes the value 1 when at least one major non-recurring external event likely to affect inbound tourism demand occurs in a given month, and 0 otherwise. In the present study, this includes exceptional events such as COVID-19-related border restrictions and mobility disruptions, diplomatic tensions between Korea and Japan, China's THAAD-related retaliatory measures, and safety-related demand shocks following the Itaewon crowd disaster. The extreme weather indicator likewise takes the value 1 when severe weather conditions, such as heavy rainfall, flooding, typhoon-related transport disruption, or heavy snowfall, are considered likely to have materially affected tourism-related mobility or travel conditions during a given month, and 0 otherwise. Taken together, exchange rates represent changes in relative travel cost and purchasing conditions; temperature and the extreme weather indicator capture climate-related influences on travel timing and destination attractiveness; and the irregular event indicator reflects non-recurring shocks associated with geopolitical, public-health, social, or other exceptional circumstances.
In the proposed SC-TCDL framework, structural shocks such as COVID-19 are explicitly represented through the irregular event indicator rather than being implicitly absorbed into the historical demand sequence. This specification allows the model to distinguish recurrent endogenous demand dynamics from non-recurring external disruptions over the forecast horizon. The LSTM branch is therefore intended to model persistence, seasonality, and recurring calendar effects embedded in past tourist arrivals, whereas the Transformer branch processes forward-looking contextual conditions and exceptional external shocks that are not governed by regular temporal recurrence. Accordingly, lagged tourist arrivals and calendar-related variables are assigned to the LSTM branch, while exchange rates, temperature, irregular events, and extreme weather conditions are assigned to the Transformer branch to reflect the functional separation between endogenous demand structure and future exogenous contextual shocks.
Unlike the variables assigned to the LSTM branch, these inputs are modeled as forward-looking contextual factors over the forecast horizon rather than as recurrent temporal sequences. Prior to model training, Transformer inputs are augmented with positional encodings to preserve temporal ordering within the attention mechanism.
Continuous variables are normalized to a [0, 1] range using Min-Max scaling based on the training dataset, while binary variables remain unscaled. Accordingly, the empirical design preserves the distinction between temporal inputs handled by the LSTM branch and contextual inputs handled by the Transformer branch, while applying preprocessing rules appropriate to the scale and type of each variable.
This study assumes that future contextual variables over the forecast horizon are available through external forecasts or scenario-based inputs. In practical forecasting environments, variables such as exchange rates, temperatures, and event-related conditions may be obtained from external forecasting agencies, scenario-based planning assumptions, or policy projections. Accordingly, the practical contribution of the proposed SC-TCDL framework lies not in assuming perfect foresight of future exogenous conditions, but in providing a conditional forecasting framework that can incorporate externally supplied future scenarios in a structured manner. In this sense, the model is particularly relevant for planning-oriented applications in which tourism authorities or industry practitioners evaluate likely demand trajectories under alternative macroeconomic, climate-related, or event-related assumptions. The practical implications of this framework for tourism-demand planning are discussed further in Section 6.2.
4.2 Restoration of Pandemic-Period Data and Visual Validation of Reconstructed Demand
To correct for structural distortions in tourism demand caused by the COVID-19 pandemic, this study implemented a virtual demand restoration strategy targeting the period from March 2020 to December 2023. This timeframe captures not only the immediate collapse in inbound travel due to international border closures, but also the prolonged suppression, volatility, and irregular seasonality that followed throughout the recovery phase.
Although the formal restoration period is defined as beginning in March 2020, a substantial decline in tourist arrivals had already emerged in February, with inbound visitors decreasing by over 40% compared to the same month of the previous year. This early collapse—triggered by international travel restrictions and heightened public concern—preceded official domestic outbreak declarations and marked the onset of exogenous shocks to the tourism sector.
Throughout the restoration period, actual tourism demand remained well below pre-pandemic levels and failed to exhibit stable seasonal trends. These distortions compromised the temporal consistency required for model training, thereby impairing the learning of meaningful patterns such as trend and seasonality. To mitigate these issues, the entire period from March 2020 through December 2023 was designated as structurally contaminated and replaced with counterfactual estimates generated by an LSTM model using data from the pre-pandemic period (February 2013 to February 2020). This approach ensured that the hybrid forecasting architecture could learn from temporally coherent input sequences, free from pandemic-induced noise, while preserving long-range dependencies.
<Figure 4> presents the full time series of monthly foreign tourist arrivals from February 2013 to December 2023, including both the actual observations and the LSTM-restored values. The restored trajectory visibly smooths over the erratic pandemic-period volatility, reinforcing the structural continuity of the sequence. <Figure 5> focuses exclusively on the pandemic-affected interval (March 2020 to December 2023), providing a direct visual contrast between the original suppressed demand and the reconstructed series. While the actual data shows a prolonged slump and gradual recovery, the restored values maintain a plausible seasonal profile aligned with pre-pandemic norms, validating the necessity and effectiveness of the restoration procedure.
To further assess the plausibility of the reconstructed demand, <Figure 6> presents a month-by-month comparison between the reconstructed tourism demand and the pre-pandemic five-year average. The results show that the restored values generally exceed historical averages while maintaining consistent seasonal patterns. This is interpreted not as arbitrary overestimation but as a reflection of the long-term upward trend captured by the LSTM model. The restored series preserves key seasonal fluctuations while mitigating distortions caused by the pandemic, thereby restoring temporal consistency in the input data. This strategy aligns with the model's architectural principle of separating endogenous temporal patterns from exogenous shocks, ultimately enhancing the reliability and generalizability of forecasting outcomes.
4.3 Forecasting Configuration and Training Strategy
The proposed SC-TCDL model addresses a 12-step direct forecasting task for monthly inbound tourist arrivals. Using the most recent 12-month historical window, the model predicts tourist arrivals for the subsequent 12 months in a single forward pass. Consistent with Sections 3.1 and 3.2, the model is implemented as a dual-branch architecture in which the LSTM branch processes temporal inputs and the Transformer branch processes forward-looking contextual inputs. Training samples are constructed using a rolling-window strategy so that the model can learn from multiple temporal contexts over the full sample period.
For each training instance, the temporal input to the LSTM branch is organized as a tensor of shape , where denotes the batch size and 12 is the input sequence length, and the three feature channels correspond to normalized tourist arrivals, a holiday indicator, and a calendar event indicator. The contextual input to the Transformer branch is organized as a tensor of shape , where the 12 time steps correspond to the forecast horizon and the four feature channels represent exchange rate, temperature, irregular event indicator, and extreme weather indicator.
The dataset spans from February 2013 to April 2025 and is chronologically divided into training (February 2013-March 2022), validation (April 2022-April 2024), and test (May 2024-April 2025) sets. Model training is implemented in TensorFlow/Keras using the Adam optimizer with a learning rate of 0.001 and a batch size of 16. The loss function is Mean Squared Error (MSE), and early stopping with a patience of five epochs is applied based on validation loss to reduce overfitting.
The LSTM branch consists of a single LSTM layer with 64 hidden units. Its final hidden state has shape and corresponds to the batch-wise implementation of the temporal summary vector introduced in Section 3.1; accordingly, the temporal summary dimension is . To support stepwise fusion with future contextual inputs, this temporal summary vector is repeated across the 12 forecast steps, yielding a tensor of shape . On the contextual side, each four-dimensional future input vector is linearly projected into a 64-dimensional latent representation before entering the Transformer encoder; accordingly, the contextual representation dimension is . Sinusoidal positional encoding of the same dimensionality is then incorporated to preserve temporal ordering across the forecast horizon. Following the skip-connected fusion scheme defined in Section 3.2, the repeated temporal summary tensor and the contextual representation tensor are concatenated stepwise along the feature dimension, corresponding to the batch-wise implementation of . This operation produces a fused sequence tensor of shape . This fused tensor is then linearly projected to the Transformer model dimension to obtain a tensor of shape , which is subsequently processed by the encoder blocks.
The Transformer branch contains two encoder blocks. Each block consists of multi-head self-attention with 4 attention heads, model dimension , feed-forward dimension , residual connections, layer normalization, and dropout with a rate of 0.20. After the final encoder block, global average pooling is applied over the 12 contextual steps to obtain a contextual representation of shape .
At the final prediction stage, the pooled contextual representation is concatenated with the LSTM temporal summary vector, resulting in a combined representation of shape . The final predictor is implemented as an MLP with two hidden layers of dimensions 64 and 32, respectively. ReLU activation is used in both hidden layers, and dropout with a rate of 0.20 is applied for regularization. The output layer is a linear dense layer with dimension 12, enabling simultaneous prediction over the full 12-month forecast horizon. Accordingly, the dimensions of the temporal summary vector, contextual representation, combined representation, and final output are 64, 64, 128, and 12, respectively.
Hyperparameters are tuned through grid search on the validation set. The search space includes LSTM hidden size {32, 64}, Transformer model dimension {32, 64}, number of attention heads {2, 4}, feed-forward dimension {64, 128}, and dropout rate {0.10, 0.20, 0.30}. The final configuration reported in this study, namely LSTM hidden size = 64, , number of heads = 4, , and dropout = 0.20, yielded the best validation performance. The maximum number of training epochs is set to 200, and the model with the lowest validation loss is selected as the final model. This best-performing specification is then applied to the final input window to generate the 12-month forecast.
5. Forecasting Results and Analysis
5.1 Comparative Evaluation of Overall and Temporal Forecast Accuracy
To assess the forecasting performance of the proposed SC-TCDL model, a comparison was conducted against three representative benchmark models. (1) SARIMA, a univariate time series model designed to capture both non-seasonal and seasonal autocorrelation patterns; (2) vanilla LSTM implemented with a single stacked LSTM layer (64 hidden units), followed by a fully connected output layer and trained solely on past tourist arrivals; and (3) a Transformer encoder, consisting of two encoder blocks with multi-head attention (4 heads) and feed-forward layers, trained on both historical and future contextual inputs without explicitly disentangling temporal sequences from exogenous signals. Forecast accuracy was evaluated using three standard metrics, MAE, RMSE, and MAPE, which jointly assess magnitude, variance, and scale-relative prediction accuracy [17, 18].
As summarized in <Table 2>, the benchmark models exhibited varying degrees of predictive accuracy. The SARIMA model produced an MAE of 113,164 people, an RMSE of 125,937 people, and a MAPE of 9.63%. The vanilla LSTM model slightly improved on SARIMA's performance, with an MAE of 109,710 people, an RMSE of 118,425 people, and a MAPE of 9.47%. The Transformer model demonstrated moderate accuracy, with an MAE of 104,640 people, an RMSE of 109,664 people, and a MAPE of 9.21%, offering marginal gains over SARIMA and vanilla LSTM but still falling short of the SC-TCDL model.
In contrast, the proposed SC-TCDL model significantly outperformed all benchmark models across all three evaluation metrics. It achieved the lowest forecasting errors, with an MAE of 78,626 people, an RMSE of 94,019 people, and a MAPE of 6.94%, thereby reducing MAE by 30.5% relative to SARIMA, 28.3% relative to the vanilla LSTM, and 24.9% relative to the Transformer; consistent reductions are also observed for RMSE and MAPE. These results demonstrate the model's enhanced capacity to jointly capture long-term temporal dependencies and contextual influences on tourism demand, leading to more precise and robust forecasts.
As a robustness check, the forecasting models were also compared using the original tourist arrival series without restoring the COVID-19 demand collapse. Under this more volatile setting, the absolute forecasting errors increased for all models due to the extreme demand disruption during the pandemic period. Nevertheless, the relative ranking of model performance remained broadly consistent with the main results reported in <Table 2>. In particular, the proposed SC-TCDL model still yielded the lowest MAE (96,547 people), followed by the Transformer model (118,203 people), LSTM (124,876 people), and SARIMA (137,526 people). This observation suggests that the forecasting advantage of SC-TCDL does not solely depend on the restoration procedure but also reflects its ability to incorporate contextual information and structural disruptions affecting tourism demand.
To evaluate the month-level forecasting performance of each model in greater detail, monthly MAE trends from May 2024 to April 2025 are presented in <Figure 7>. The SC-TCDL model demonstrated relatively low MAE values across most months of the forecast horizon, exhibiting a generally smooth and stable error trajectory. This indicates the model's structural strength in effectively capturing not only recurring seasonal patterns but also month-specific variations in demand.
On the other hand, the SARIMA model exhibited pronounced error spikes in October 2024, March 2025, and April 2025, indicating substantial instability in its forecasting accuracy during these months. The vanilla LSTM model also showed notable error surges in October 2024 and consistently elevated errors from February through April 2025. These patterns reflect the model's limited robustness across fluctuating temporal patterns. The Transformer model demonstrated intermediate performance with less fluctuation than SARIMA and vanilla LSTM.
The Wilcoxon signed-rank test [27] was applied to monthly MAE values to assess the statistical significance of the SC-TCDL model's improvements. The results confirmed that the SC-TCDL model significantly outperformed SARIMA (p = 0.004), vanilla LSTM (p = 0.016), and Transformer (p = 0.042), reinforcing that its performance gains are both consistent and statistically robust.
In summary, the SC-TCDL model achieved the highest overall accuracy and maintained relatively stable forecasting performance across the 12-month horizon. Its generally smooth error trajectory suggests enhanced robustness and adaptability to varying temporal patterns.
5.2 Component-wise Contribution Analysis: LSTM and Transformer Modules
To evaluate the respective contributions of the temporal and contextual branches within the SC-TCDL architecture, an ablation study was conducted using three model configurations: the full SC-TCDL model, an LSTM-only model, and a Transformer-only model. <Table 3> reports the forecasting errors for each configuration using MAE, RMSE, and MAPE over the test period.
The results show that both the LSTM and Transformer branches contribute meaningfully to the overall performance of the hybrid model, with the combined SC-TCDL configuration achieving the highest accuracy (MAE = 78,626; RMSE = 94,019; MAPE = 6.94%). The LSTM-only model, which leverages autoregressive temporal signals such as seasonality and holiday effects, outperforms the Transformer-only variant (MAE = 108,807 vs. 127,664; RMSE = 119,533 vs. 138,697; MAPE = 9.34% vs. 11.13%), reflecting the strong predictive power of historical demand dynamics. Still, the Transformer-only model, trained exclusively on forward-looking exogenous variables such as exchange rate, temperature, and irregular events, still produces informative forecasts, highlighting the importance of contextual signals in capturing future fluctuations in tourist demand.
<Figure 8> presents the monthly MAE trends for the SC-TCDL, LSTM-only, and Transformer-only models across the 12-month forecast horizon. The SC-TCDL model consistently outperforms the ablated variants in most months, maintaining the lowest error levels throughout the forecast period, with the exception of a temporary increase in February 2025. The LSTM-only model exhibits relatively stable error patterns but persistently underperforms compared to SC-TCDL. In contrast, the Transformer-only model displays substantial fluctuations, with pronounced error spikes in January and April 2025, highlighting its vulnerability to temporal variation when lacking autoregressive input structure.
These results confirm the synergistic effect of integrating temporal and contextual streams through the SC-TCDL architecture. Compared to the LSTM-only baseline, the SC-TCDL model improves MAE by 27.7%, RMSE by 21.3%, and MAPE by 25.7%. Against the Transformer-only model, it achieves improvements of 38.4% in MAE, 32.2% in RMSE, and 37.6% in MAPE.
To determine whether the observed performance differences were statistically significant, Wilcoxon signed-rank tests were performed on the monthly MAE scores. The SC-TCDL model exhibited significantly better forecasting accuracy than both the LSTM-only model (p = 0.003) and the Transformer-only model (p = 0.000). These results provide strong statistical evidence that the enhanced performance of the SC-TCDL model stems from its architectural integration of temporal and contextual information, rather than random variation.
5.3 Contextual Variable Contribution Analysis
To further interpret how contextual variables influence forecasting performance, a permutation-based contribution analysis was conducted on the trained SC-TCDL model. While the ablation study in Section 5.2 confirmed that the contextual branch improves predictive accuracy, it does not reveal the relative importance of individual contextual variables. The present analysis therefore evaluates how each exogenous variable contributes to predictive performance within the Transformer branch.
Specifically, for each contextual variable, namely exchange rate, temperature, irregular event indicator, and extreme weather indicator, the observed sequence over the test horizon was randomly permuted while all other inputs were kept unchanged. The trained SC-TCDL model was then used to generate forecasts under the perturbed input configuration. If a variable contains meaningful predictive information, disrupting its temporal alignment should degrade forecast accuracy. The resulting increase in forecast error relative to the unpermuted SC-TCDL model therefore provides a quantitative measure of its predictive contribution.
<Table 4> reports the results of the permutation-based contribution analysis. When the exchange-rate sequence is permuted, the MAE increases by 14,020 people relative to the unpermuted SC-TCDL model, indicating that exchange-rate fluctuations provide the most influential contextual signal in the present study. Temperature shows the second-largest MAE increase of 8,580 people, suggesting that climate conditions also play an important role in explaining seasonal tourism demand patterns.
The irregular event indicator produces an MAE increase of 6,560 people, indicating that episodic external shocks, such as international events or geopolitical disruptions, provide additional predictive information beyond regular seasonal patterns. Finally, permuting the extreme weather indicator increases the MAE by 3,970 people, suggesting that abnormal climate conditions affect short-term tourism fluctuations, although their overall contribution is smaller than that of macroeconomic and broader climate-related drivers.
Relative contribution percentages were computed by dividing each variable's permutation-induced MAE increase by the sum of MAE increases across all contextual variables. On this basis, exchange rate accounts for 42.3% of the total contextual contribution, followed by temperature at 25.9%, irregular event indicator at 19.8%, and extreme weather indicator at 12.0%. These results suggest that the predictive gains obtained by the SC-TCDL architecture are driven primarily by macroeconomic and climate-related contextual information.
<Figure 9> visualizes contextual variable contributions measured by permutation-induced MAE increases. From a practical forecasting perspective, these findings suggest that tourism demand forecasting systems may benefit particularly from monitoring exchange-rate movements and climate conditions, while irregular external shocks and extreme weather events provide complementary contextual information. It should be noted, however, that these contribution scores represent model-conditional predictive importance rather than causal effects. Because contextual variables may interact within the Transformer encoder, the reported values should be interpreted as relative contributions within the trained SC-TCDL model rather than as isolated structural impacts.
5.4 Forecast Visualization and Uncertainty Analysis
<Figure 10> presents a comparison between the actual monthly foreign tourist arrivals and the forecasts generated by the proposed SC-TCDL model over a 12-month horizon. The predicted line closely follows the actual values, suggesting that the model effectively captures both seasonal variations and underlying shifts in tourism demand.
The shaded region represents the 95% prediction interval, estimated using Monte Carlo Dropout inference with 100 stochastic forward passes. This interval reflects the model's internal uncertainty, capturing the prediction variance induced by dropout during inference [5]. Throughout the forecast horizon, the prediction intervals remained relatively stable without substantial widening, indicating a high degree of predictive reliability. This consistent uncertainty pattern suggests that the model maintains a well-calibrated forecast structure capable of delivering trustworthy estimates across varying demand conditions.
From a policy and planning perspective, such stability is particularly valuable. For instance, when prediction intervals narrow during peak travel seasons, tourism authorities can make more confident decisions regarding marketing campaigns, accommodation planning, and foreign visitor service allocations. Conversely, wider intervals may prompt more cautious strategies during periods of increased volatility.
The stability and alignment of the predicted values and prediction intervals reflect the complementary nature of the SC-TCDL model's hybrid architecture. By disentangling and separately modeling temporal dependencies and contextual influences, the model not only improves point prediction accuracy but also contributes to reliable uncertainty quantification. These characteristics suggest that the model holds strong potential for forecasting in tourism environments characterized by demand variability and policy sensitivity.
6. Conclusion
6.1 Summary of Contributions
This study proposes the SC-TCDL model, a novel hybrid deep learning architecture designed for forecasting monthly foreign tourist arrivals in South Korea. By structurally disentangling historical temporal signals and future contextual information into independent LSTM and Transformer branches, respectively, the model provides a functionally specialized framework that reflects the distinct nature of tourism demand drivers.
The temporal branch, powered by LSTM, captures sequential regularities such as seasonality and holiday effects, while the contextual branch, implemented with a Transformer encoder, models forward-looking exogenous factors such as exchange rate, temperature, irregular events, and extreme weather. The temporal summary produced by the LSTM is injected into the Transformer via skip connections. The enriched contextual representations generated by the Transformer are then concatenated with the LSTM's temporal summary and passed through the MLP, enabling accurate 12-step forecasts while preserving the independence of temporal and contextual features.
The entire set of experiments was conducted using real-world monthly tourism data spanning February 2013 to April 2025. During the designated test period (May 2024–April 2025), the SC-TCDL model consistently outperforms both the traditional statistical model (SARIMA) and deep learning baselines (vanilla LSTM, Transformer). Forecasting performance improvements are validated using MAE, RMSE, and MAPE metrics, with Wilcoxon signed-rank tests confirming the statistical significance of the observed gains. Furthermore, the results of an ablation study highlight the individual predictive capabilities of the LSTM and Transformer branches. While each module contributes meaningfully on its own, their integration within the SC-TCDL architecture yields notably improved accuracy. In addition, a permutation-based contribution analysis of the contextual branch shows that exchange rate and temperature are the most influential exogenous variables, while irregular events and extreme weather provide complementary predictive information within the trained SC-TCDL model.
To mitigate the structural disruptions induced by the COVID-19 pandemic, this study implements a virtual demand restoration approach based on an LSTM model trained on pre-pandemic data. The reconstructed values are visually assessed against historical seasonal patterns to verify their temporal alignment and plausibility.
Moreover, the model incorporates Monte Carlo Dropout for uncertainty quantification. This enables not only robust point predictions but also well-calibrated prediction intervals that reflect the model's internal uncertainty across varying demand conditions. These probabilistic forecasts provide decision-makers with actionable insights for contingency planning and strategic resource allocation.
In conclusion, the SC-TCDL model advances hybrid time-series forecasting by providing a modular forecasting framework that demonstrates promising empirical performance in tourism demand prediction. The architectural principles and empirical validations presented in this study offer broad applicability for multivariate forecasting tasks in domains affected by both recurring and disruptive factors.
6.2 Limitations and Future Directions
Although the SC-TCDL model has shown promising results, certain limitations remain and should be acknowledged as opportunities for future investigation. First, the present study focuses exclusively on national-level aggregate forecasts of inbound tourism demand, without accounting for heterogeneity across traveler segments or destination regions. Given the potential variability in demand patterns by country of origin, travel purpose, or regional attractions within South Korea, future research should explore the incorporation of spatial or demographic disaggregation to improve the model's granularity and policy relevance.
Second, the present study assumes that forward-looking contextual variables are available through external forecasts or scenario-based planning inputs during the forecasting horizon. Although this assumption is consistent with the conditional forecasting framework adopted in the present study and supports planning-oriented applications, it may limit the direct applicability of the model in fully real-time forecasting settings where such information is unavailable or highly uncertain. Future research could extend the proposed framework by jointly forecasting key contextual variables or by incorporating probabilistic scenario generation techniques and uncertainty-aware architectures to better reflect real-world forecasting conditions and enhance the model's robustness.




















