In the year 2912, our technological advancements have taken us far beyond the confines of Earth, leading us to explore and colonize distant exoplanets. However, with great exploration comes new challenges, and today, we face a cosmic mystery that demands our best data science skills.
The Incident
A month ago, the Spaceship Titanic, an interstellar passenger liner, embarked on its maiden voyage. Its mission was to transport nearly 13,000 emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. The journey took an unexpected turn when, while rounding Alpha Centauri, the spaceship collided with a spacetime anomaly hidden within a dust cloud. This collision, eerily reminiscent of its Earth-bound namesake’s fate, resulted in nearly half of its passengers being transported to an alternate dimension.
The Challenge
The Spaceship Titanic’s tragedy has led to an unprecedented challenge: predicting which passengers were transported by the anomaly. To assist in this mission, we’ve retrieved personal records from the ship’s damaged computer system, providing us with crucial data to analyze.
The Approach
To tackle this challenge, I employed a variety of data science techniques. The process began with the essential task of data preprocessing, where I handled missing values and encoded categorical variables. I used Python, along with libraries like pandas and sklearn, to cleanse and prepare the data for analysis.
Next, I chose the RandomForestClassifier model for its effectiveness in handling both categorical and numerical data. Through a carefully constructed pipeline, I trained the model on the available passenger data, identifying patterns and correlations that could predict the likelihood of a passenger being transported to the alternate dimension.
Predicting the Unseen
The final step was to apply this model to unseen test data, predicting the fate of each passenger. By feeding the test data through the same preprocessing steps and then using my trained model, I generated predictions that could potentially aid rescue operations.
Creating a Path for Rescue
The outcomes were compiled into a submission file, ready to be shared with the rescue crews. This file, a beacon of hope, could significantly increase the efficiency and success rate of the rescue missions, potentially saving thousands of lives.
Conclusion
This project is more than a showcase of data science skills; it’s a testament to how technology can be harnessed to solve not just earthly challenges but cosmic mysteries too. As I continue to reach for the stars, the interplay between technology and human ingenuity will remain my greatest asset in navigating the unknown frontiers of space.
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/spaceship-titanic/sample_submission.csv /kaggle/input/spaceship-titanic/train.csv /kaggle/input/spaceship-titanic/test.csv
# Load the Spaceship Titanic training dataset from a CSV file and display the first 5 rows.
train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
train_data.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | Transported | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | False |
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | True |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | False |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | False |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | True |
# Load the Spaceship Titanic test dataset from a CSV file and display the first 5 rows.
test_data = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
test_data.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0013_01 | Earth | True | G/3/S | TRAPPIST-1e | 27.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Nelly Carsoning |
1 | 0018_01 | Earth | False | F/4/S | TRAPPIST-1e | 19.0 | False | 0.0 | 9.0 | 0.0 | 2823.0 | 0.0 | Lerome Peckers |
2 | 0019_01 | Europa | True | C/0/S | 55 Cancri e | 31.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Sabih Unhearfus |
3 | 0021_01 | Europa | False | C/1/S | TRAPPIST-1e | 38.0 | False | 0.0 | 6652.0 | 0.0 | 181.0 | 585.0 | Meratz Caltilter |
4 | 0023_01 | Earth | False | F/5/S | TRAPPIST-1e | 20.0 | False | 10.0 | 0.0 | 635.0 | 0.0 | 0.0 | Brence Harperez |
# Importing necessary libraries for data manipulation and machine learning
from sklearn.impute import SimpleImputer # for handling missing values in numerical data
from sklearn.preprocessing import OneHotEncoder # for converting categorical variables into a form that could be provided to ML algorithms
from sklearn.compose import ColumnTransformer # for performing different transformations on different columns
# Define the features and target variable for the model
features = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP',
'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'] # list of features used for prediction
target = 'Transported' # the target variable we want to predict
# Handling missing values and encoding categorical variables
numerical_features = train_data[features].select_dtypes(include=['int64', 'float64']).columns # extracting names of numerical columns
categorical_features = train_data[features].select_dtypes(include=['object', 'bool']).columns # extracting names of categorical columns
# Creating transformers for preprocessing the data
numerical_transformer = SimpleImputer(strategy='mean') # transformer for filling missing values in numerical features with the mean
categorical_transformer = OneHotEncoder(handle_unknown='ignore') # transformer for encoding categorical features into one-hot vectors
# Combining transformers into a single preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features), # apply numerical_transformer to numerical features
('cat', categorical_transformer, categorical_features) # apply categorical_transformer to categorical features
]
)
# Importing necessary libraries for building a machine learning model
from sklearn.ensemble import RandomForestClassifier # RandomForestClassifier for classification tasks
from sklearn.pipeline import Pipeline # Pipeline to chain together preprocessing and model training
# Define the model
model = RandomForestClassifier(n_estimators=100, random_state=0)
# Creating a RandomForestClassifier model with 100 trees (n_estimators) and a fixed random state for reproducibility
# Creating the preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor), # First step: preprocessing the data using the earlier defined preprocessor
('model', model)]) # Second step: training the RandomForestClassifier model
# Splitting the data into features and target
X_train = train_data[features] # Extracting the features for training from the train_data
y_train = train_data[target] # Extracting the target variable (what we want to predict) from the train_data
# Training the model
pipeline.fit(X_train, y_train)
# Fitting the pipeline to the training data: this includes preprocessing the data and then training the model
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', SimpleImputer(), Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')), ('cat', OneHotEncoder(handle_unknown='ignore'), Index(['HomePlanet', 'CryoSleep', 'Destination', 'VIP'], dtype='object'))])), ('model', RandomForestClassifier(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', SimpleImputer(), Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')), ('cat', OneHotEncoder(handle_unknown='ignore'), Index(['HomePlanet', 'CryoSleep', 'Destination', 'VIP'], dtype='object'))])), ('model', RandomForestClassifier(random_state=0))])
ColumnTransformer(transformers=[('num', SimpleImputer(), Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')), ('cat', OneHotEncoder(handle_unknown='ignore'), Index(['HomePlanet', 'CryoSleep', 'Destination', 'VIP'], dtype='object'))])
Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')
SimpleImputer()
Index(['HomePlanet', 'CryoSleep', 'Destination', 'VIP'], dtype='object')
OneHotEncoder(handle_unknown='ignore')
RandomForestClassifier(random_state=0)
# Preprocessing the test data and making predictions
X_test = test_data[features]
# Extracting the features from the test dataset. These are the same features used for training the model.
test_predictions = pipeline.predict(X_test)
# Making predictions on the test dataset using the trained pipeline.
# The pipeline automatically applies the same preprocessing (like handling missing values and encoding categorical variables)
# before feeding the data into the RandomForestClassifier model to make predictions.
# Creating the submission file
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Transported': test_predictions})
# Creating a DataFrame 'output' with two columns: 'PassengerId' and 'Transported'.
# 'PassengerId' is taken from the test dataset and 'Transported' contains the predictions from the model.
output.to_csv('submission_spaceship_titanic.csv', index=False)
# Saving the DataFrame to a CSV file named 'submission_spaceship_titanic.csv'.
# The 'index=False' parameter is used to indicate that the index of the DataFrame should not be written to the file.
print("Your submission was successfully saved!")
# Printing a message to the console to confirm that the submission file has been successfully created and saved.
Your submission was successfully saved!
Take a look behind the scenes of my daily work, discover interesting facts from the world of data and the latest news about ITG
Unveiling the Android App Market: A Data-Driven Approach
Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps […]
Solving the Cosmic Mystery of Spaceship Titanic with Data Science
In the year 2912, our technological advancements have taken us far beyond the confines of Earth, leading us to explore and colonize […]
Navigating History and Data Science: The Titanic Kaggle Challenge
Kaggle is a prominent platform that has revolutionized the way we approach data science. It hosts various competitions, one of which is […]
Making Predictions with Linear Regression
Linear Regression stands as one of the most fundamental algorithms in the field of machine learning and data science. It’s not only […]