In the year 2912, our technological advancements have taken us far beyond the confines of Earth, leading us to explore and colonize distant exoplanets. However, with great exploration comes new challenges, and today, we face a cosmic mystery that demands our best data science skills.

The Incident

A month ago, the Spaceship Titanic, an interstellar passenger liner, embarked on its maiden voyage. Its mission was to transport nearly 13,000 emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. The journey took an unexpected turn when, while rounding Alpha Centauri, the spaceship collided with a spacetime anomaly hidden within a dust cloud. This collision, eerily reminiscent of its Earth-bound namesake’s fate, resulted in nearly half of its passengers being transported to an alternate dimension.

The Challenge

The Spaceship Titanic’s tragedy has led to an unprecedented challenge: predicting which passengers were transported by the anomaly. To assist in this mission, we’ve retrieved personal records from the ship’s damaged computer system, providing us with crucial data to analyze.

The Approach

To tackle this challenge, I employed a variety of data science techniques. The process began with the essential task of data preprocessing, where I handled missing values and encoded categorical variables. I used Python, along with libraries like pandas and sklearn, to cleanse and prepare the data for analysis.

Next, I chose the RandomForestClassifier model for its effectiveness in handling both categorical and numerical data. Through a carefully constructed pipeline, I trained the model on the available passenger data, identifying patterns and correlations that could predict the likelihood of a passenger being transported to the alternate dimension.

Predicting the Unseen

The final step was to apply this model to unseen test data, predicting the fate of each passenger. By feeding the test data through the same preprocessing steps and then using my trained model, I generated predictions that could potentially aid rescue operations.

Creating a Path for Rescue

The outcomes were compiled into a submission file, ready to be shared with the rescue crews. This file, a beacon of hope, could significantly increase the efficiency and success rate of the rescue missions, potentially saving thousands of lives.

Conclusion

This project is more than a showcase of data science skills; it’s a testament to how technology can be harnessed to solve not just earthly challenges but cosmic mysteries too. As I continue to reach for the stars, the interplay between technology and human ingenuity will remain my greatest asset in navigating the unknown frontiers of space.

In [ ]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/spaceship-titanic/sample_submission.csv
/kaggle/input/spaceship-titanic/train.csv
/kaggle/input/spaceship-titanic/test.csv

In [ ]:

# Load the Spaceship Titanic training dataset from a CSV file and display the first 5 rows.
train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
train_data.head()

Out[ ]:

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Name	Transported
0	0001_01	Europa	False	B/0/P	TRAPPIST-1e	39.0	False	0.0	0.0	0.0	0.0	0.0	Maham Ofracculy	False
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	Juanna Vines	True
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	Altark Susent	False
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	Solam Susent	False
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	Willy Santantines	True

In [ ]:

# Load the Spaceship Titanic test dataset from a CSV file and display the first 5 rows.
test_data = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
test_data.head()

Out[ ]:

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Name
0	0013_01	Earth	True	G/3/S	TRAPPIST-1e	27.0	False	0.0	0.0	0.0	0.0	0.0	Nelly Carsoning
1	0018_01	Earth	False	F/4/S	TRAPPIST-1e	19.0	False	0.0	9.0	0.0	2823.0	0.0	Lerome Peckers
2	0019_01	Europa	True	C/0/S	55 Cancri e	31.0	False	0.0	0.0	0.0	0.0	0.0	Sabih Unhearfus
3	0021_01	Europa	False	C/1/S	TRAPPIST-1e	38.0	False	0.0	6652.0	0.0	181.0	585.0	Meratz Caltilter
4	0023_01	Earth	False	F/5/S	TRAPPIST-1e	20.0	False	10.0	0.0	635.0	0.0	0.0	Brence Harperez

In [ ]:

# Importing necessary libraries for data manipulation and machine learning
from sklearn.impute import SimpleImputer # for handling missing values in numerical data
from sklearn.preprocessing import OneHotEncoder # for converting categorical variables into a form that could be provided to ML algorithms
from sklearn.compose import ColumnTransformer # for performing different transformations on different columns

# Define the features and target variable for the model
features = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 
            'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'] # list of features used for prediction
target = 'Transported' # the target variable we want to predict

# Handling missing values and encoding categorical variables
numerical_features = train_data[features].select_dtypes(include=['int64', 'float64']).columns # extracting names of numerical columns
categorical_features = train_data[features].select_dtypes(include=['object', 'bool']).columns # extracting names of categorical columns

# Creating transformers for preprocessing the data
numerical_transformer = SimpleImputer(strategy='mean') # transformer for filling missing values in numerical features with the mean
categorical_transformer = OneHotEncoder(handle_unknown='ignore') # transformer for encoding categorical features into one-hot vectors

# Combining transformers into a single preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features), # apply numerical_transformer to numerical features
        ('cat', categorical_transformer, categorical_features) # apply categorical_transformer to categorical features
    ]
)

In [ ]:

# Importing necessary libraries for building a machine learning model
from sklearn.ensemble import RandomForestClassifier # RandomForestClassifier for classification tasks
from sklearn.pipeline import Pipeline # Pipeline to chain together preprocessing and model training

# Define the model
model = RandomForestClassifier(n_estimators=100, random_state=0)
# Creating a RandomForestClassifier model with 100 trees (n_estimators) and a fixed random state for reproducibility

# Creating the preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor), # First step: preprocessing the data using the earlier defined preprocessor
                           ('model', model)]) # Second step: training the RandomForestClassifier model

# Splitting the data into features and target
X_train = train_data[features] # Extracting the features for training from the train_data
y_train = train_data[target] # Extracting the target variable (what we want to predict) from the train_data

# Training the model
pipeline.fit(X_train, y_train)
# Fitting the pipeline to the training data: this includes preprocessing the data and then training the model

Out[ ]:

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', SimpleImputer(),
                                                  Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  Index(['HomePlanet', 'CryoSleep', 'Destination', 'VIP'], dtype='object'))])),
                ('model', RandomForestClassifier(random_state=0))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

# Preprocessing the test data and making predictions
X_test = test_data[features]
# Extracting the features from the test dataset. These are the same features used for training the model.

test_predictions = pipeline.predict(X_test)
# Making predictions on the test dataset using the trained pipeline. 
# The pipeline automatically applies the same preprocessing (like handling missing values and encoding categorical variables) 
# before feeding the data into the RandomForestClassifier model to make predictions.

In [ ]:

# Creating the submission file
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Transported': test_predictions})
# Creating a DataFrame 'output' with two columns: 'PassengerId' and 'Transported'.
# 'PassengerId' is taken from the test dataset and 'Transported' contains the predictions from the model.

output.to_csv('submission_spaceship_titanic.csv', index=False)
# Saving the DataFrame to a CSV file named 'submission_spaceship_titanic.csv'.
# The 'index=False' parameter is used to indicate that the index of the DataFrame should not be written to the file.

print("Your submission was successfully saved!")
# Printing a message to the console to confirm that the submission file has been successfully created and saved.

Your submission was successfully saved!

Solving the Cosmic Mystery of Spaceship Titanic with Data Science

The Incident

The Challenge

The Approach

Predicting the Unseen

Creating a Path for Rescue

Conclusion

Take a look behind the scenes of my daily work, discover interesting facts from the world of data and the latest news about ITG

Unveiling the Android App Market: A Data-Driven Approach

Solving the Cosmic Mystery of Spaceship Titanic with Data Science

Navigating History and Data Science: The Titanic Kaggle Challenge

Making Predictions with Linear Regression

The Incident

The Challenge

The Approach

Predicting the Unseen

Creating a Path for Rescue

Conclusion

Take a look behind the scenes of my daily work, discover interesting facts from the world of data and the latest news about ITG

Unveiling the Android App Market: A Data-Driven Approach

Solving the Cosmic Mystery of Spaceship Titanic with Data Science

Navigating History and Data Science: The Titanic Kaggle Challenge

Making Predictions with Linear Regression

You might also like:

Unveiling the Android App Market: A Data-Driven Approach

Navigating History and Data Science: The Titanic Kaggle Challenge

Making Predictions with Linear Regression