You are asked to carry out data acquisition, preparation and exploration steps based on the three data sources according to the given instructions. For example, you need to develop and implement appropriate steps to load

Department of Computing Technologies

COS60008 Introduction to Data Science

Semester 1 2024 – Assignment 1

Due: 23:59, Friday 15 March 2024

Introduction

This is an individual assignment and worth 15% of your final grade. It intends to evaluate your understanding and practical skills to deal with the first few steps in a typical data science process.

In this assignment, you are provided three data files, i.e., “data1.csv”, “data2.csv” and “data3.csv”, which form the dataset created from a higher education institution related to students enrolled in different undergraduate degrees1. The files “data1.csv” and “data2.csv” contain the same set of students but distinct sets of attributes for describing the student, where each student has its unique ID. The file “data3.csv” contains a different set of students with each student described by all attributes from both “data1.csv” and “data2.csv”.

You are asked to carry out data acquisition, preparation and exploration steps based on the three data sources according to the given instructions. For example, you need to develop and implement appropriate steps to load and merge the data from the three data files, perform data cleaning, make explorative data analysis, and report your findings.

A discussion forum for the assignment will be available in Canvas. If required, further announcements about the assignment will be posted in Canvas. You are responsible for checking Canvas on a regular basis to stay informed with regards to any updates about the assignment.

Academic Integrity

The submitted assignment must be your own work, and any parts that are not created by yourself must be properly referenced. Plagiarism is treated very seriously at Swinburne. It includes submitting the code and/or text copied from other students, the Internet or other resources without proper reference. Allowing others to copy your work is also plagiarism. Please note that you should always create your own assignment even if you have very similar ideas with other students.

Plagiarism detection software will be used to check your submissions. Severe penalties (e.g., zero mark) will be applied in cases of plagiarism. For further information, please refer to the relevant section in the Unit Outline under the menu “Syllabus” in Canvas and the Academic Integrity information at

General Requirements

This section contains the general requirements which must be met by your submitted assignment. Marks will be deducted if you fail to meet any of the following general requirements.

• Use Python 3 & Notebook: You must complete Tasks 1 and 2 using the Jupyter Notebook format with a Python 3 kernel.

• Use a single notebook file: All code for Tasks 1 and 2 must be written inside a SINGLE notebook file (assignment1.ipynb).

• Include the header section of markdown: At the start of the notebook file, include and complete the following header section as a cell (for correct Semester). Remove the ( ) text by replacing them with your details.

• Include Task Headings: Before each task include an appropriate Markdown cell with the task label as a level 2 ## heading. For example. ## Task 1 – Data Acquisition & Preparation

• Use cells for sub-tasks: Create appropriate cells for sub-tasks within Tasks 1 and 2.

• Don’t have a single cell with too much code that combines different sub-tasks.

• Don’t have a single cell for every single line of python code.

• It is your job to communicate effectively.

• Code Comments: You must include code-level comments in your assignment1.ipynb file to explain the key parts of your code.

• If you do not have code comments that support your code answer, your mark will be reduced even if the code is correct. (Note that this is for KEY parts of your work, not every part of it.)

• It is valuable to make your code comments unique so that your work is not like other students when assessed. Put things in your own words!

• You do NOT have to explain every single line of code or things that are very easy for another programmer to understand.

• Graphs are Clear and Labelled: All your plots should have appropriate titles and axis labels. They need to be presented clearly so that they can be easily understood.

• Follow Tasks Instructions: You must follow the instructions exactly as given in each task and complete them. • Create a Report: You must create the report for Task 3 exactly as instructed.

• Submit the report as a PDF file named “assignment1.pdf”.

• You must include the headings and details as specified in the Task 3 instructions.

• Submit Correctly: You must follow the details specified in the “Submission Requirements” section to make your final submission.

Task 1 – Data Acquisition & Preparation (30%)

Firstly, you need to acquire three data files “data1.csv”, “data2.csv”, and “data3.csv”, which are included in a single .zip file named “assignment1_data.zip”, under the menu “Assignments” > “Assignment 1” in Canvas. Put these files into your working folder for the assignment in Jupyter Lab ready to use.

These data files are adapted from the “Student Drop out and Academic Success” data set in the UCI repository2, and contain many records of students with each record corresponding to a specification of a student in terms of various attributes.

The files “data1.csv” and “data2.csv” contain the same set of students but two distinct sets of attributes for describing a student. In contrast, the file “data3.csv” contains a different set of students, where each record of the student consists of all attributes from both “data1.csv” and “data2.csv”.

The set of 38 possible attributes for a student record and their corresponding value ranges is shown in Table 2 in the Appendix section of this document.

As a data scientist, you have been asked to analyse the data from the three data files. However, before doing that you know that you need to carry out some data preparation operations, e.g., merging and cleaning the data.

In this task, you are asked to utilise the Python package “Pandas” to do the following steps:

1.1. Load the data from the three data files into three Pandas DataFrame entities and check whether each loaded data sets is equivalent to the data contained in the raw data files.

1.2. Merge the three data frames into a single one that contains all students, where each student has a unique ID and is described by all the 38 attributes listed (see Table 2).

1.3. Clean the data by using the knowledge you have learned.

• You need to deal with the issues existing in the data, e.g., missing values, duplicates, impossible values and extra whitespaces. However, you must NOT modify any parts of data that do not suffer from issues. Failing to do so will lead to mark reduction.

• When dealing with missing values (if any), you can remove an entire row or column ONLY IF more than 50% of its elements are missing. Otherwise, you must find other appropriate cleaning methods to handle missing values.

• You must be able to explain how you detect each data issue and why you choose a specific cleaning method to deal with it.

Task 2 – Data Exploration (25%)

At this point you should have finished Task 1 and obtained a single DataFrame containing the merged and cleaned data. You can now start to explore your data by carrying out the following steps:

2.1. Choose one column each with categorical and numerical values, respectively. Visualise the data of each column type in an appropriate way. Note that you need to explore and identify potentially important columns, and be able to justify your choice. Don’t just make a random choice. Explore and then decide.

2.2. Choose three pairs of columns and explore the relationship within the column pairs using appropriate descriptive statistics and visualisation tools. Like Task 2.1 don’t just make a random choice. Explore the data and then decide. Your choice of column pairs should be done to address a “plausible hypothesis” on the data.

2.3. Choose six (6) numerical columns and build a scatter matrix. State why you selected the columns.

Note: Graphs (plots) must contain appropriate titles, axis labels, etc. to make themselves self-explained. Graphs should be clear enough for readers to read and understand (size and information).

Titles and labels on your graphs also help you to not be confused about what graph you are looking at! You will be creating many graphs so it is worth doing this properly from the start.

For your graphs, you will probably need to investigate what appropriate axes text labels to use, as the data set does not have the text description of the numerical columns.

Task 3 – Report (45%)

In this task, you are asked to write a report to elaborate your analyses and findings from Task 2 and 3.

NOTE: In the report you will be explaining things.

Do NOT include Python code in this report as that is already in your notebook file.

When you are asked to explain how you did something, focus on the concepts or principles, not the code used. You can refer to the cell where code is if you want but it usually not needed.

We DO want to see your clear communication with words, and supported by graphs!

Start by giving your report an appropriate title (the exact title is up to you), and include your name, student ID and student email address, and the date of the report. This is a good professional communication standard to have for all your reports. Make sure these details match exactly the details in your notebook file (assignment1.ipynb).

Also include the Unit Code and Title, and the Year and Semester. The exact layout and formatting of the report is up to you, but a simple template has been provided. You do NOT have to use the template.

You should then:

3.1. Create a sub-heading titled “Introduction”

• In one paragraph (with approximately 3-4 sentences) clearly state the purpose of this report. Tip: Explain what the data source is, why you have written the report (past tense, “To communicate the findings of Task 2 and Task 3 of the assignment”), and what key findings (if any) you found (as a one-sentence summary.) This will set you up for the next two sub-sections.

3.2. Create a sub-heading tilted “Task 1: Data Acquisition & Preparation” in your report under which you should: • Briefly describe how you addressed this task.

• Describe how you merged the data from the three data files

• Describe each of the data issues you detected in data cleaning, explain how you detected it, and justify why you chose a specific data cleaning method to deal with it.

• Discuss any problems you encountered when undertaking this task and how you solved them.

3.3. Create a sub-heading named “Task 2: Data Exploration” in your report under which you need to: • Create a sub-section with an appropriate title for each of the three sub-tasks in Task 2.

• In the sub-section for sub-task 2.1, for each selected column, include the graph(s) created for that column, and provide a brief explanation on why you chose that column and a specific visualisation method to explore it.

• In the sub-section for sub-task 2.2, briefly explain why you chose each of the three pairs of columns (e.g., stating the hypotheses that you intended to address), include the descriptive statistics and graph(s) for each of the three selected pairs, followed by a brief discussion on any interesting findings about the presence or lack of relationship between the two involved columns.

• In the sub-section for sub-task 2.3, include the plot of the scatter matrix, state why you selected the six columns (i.e. were you hoping to see a particular relationship?) ,and report your findings from the plot.

3.4. Create a sub-heading titled “Conclusions”

• In one paragraph (with approximately 1-2 sentences) restate the key outcomes of this report. Don’t say anything that hasn’t already been stated before. Make sure it matches with the Introduction and the purpose you described there.

Note: You must give each graph a figure number and a brief caption (e.g. “Figure 1. Relationship shown between …”) and you must refer to each figure in the text of your report. Don’t use words like “above” or “below (e.g. Don’t write “In Figure 2 below it shows …” just use “In Figure 2 it shows …”). That way, your graphs will always make sense to the reader and it does not matter if they are moved around!

Tip: It is okay to add very clear sub-sub-headings to address each of the requirements above. Avoid large sentences and large paragraphs. Try to be direct and concise with your words. We do not mark based on how many words you write, but on the quality of the points you make. We do take marks away if there are too many unrelated words or if the words don’t add a valuable point that has been asked for.

The report must be saved in the PDF format and named “assignment1.pdf” for submission.

Your final report file MUST

• be named “assignment1.pdf”,

• be written in a single column format, with

• font size between 10 and 12 points, and

• have no more than 7 pages (including tables, graphs and/or references).

Penalties (mark deductions) will apply if the report does not satisfy these requirements. Moreover, the quality of the report will be considered when marking, e.g. organisation, clarity, and grammatical mistakes.

Please remember to cite any sources (as “References”) which you have referred to when doing your work! Sometime a “footnote” will be appropriate. Remember that citing sources is a way to show what you know and understand, and should not be avoided. References and footnotes are evidence all good students will have in their reports.

Submission Requirements

The assignment is due at: 23:59, Friday 15 March 2024

Assignments submitted after this time are subjected to late submission penalties. For detailed information, refer to the relevant section in the Unit Outline under the menu “Syllabus” in Canvas.

You need to prepare the following three files:

1. A notebook file named assignment1.ipynb which contains markdown headings, and all your code and code-level comments for Tasks 1 and 2.

2. An HTML version of the notebook file with output as assignment1.html.

3. A report file named assignment1.pdf which must strictly follow the format requirements detailed in Task 3.

Note: Please make sure to clean the code before making submission to remove all unnecessary code. Ensure you see all the data printed and all the graphs displayed as expected in your file.

To submit, you must upload these THREE files in Canvas under: “Assignments” > “Assignment 1”

1. assignment1.ipynb

2. assignment1.html

3. assignment1.pdf

Please do NOT submit any other unnecessary files. Marks will be deducted if you do.

Extensions will only be permitted in exceptional circumstances. You should always backup your code and other assignment-related documents frequently to avoid potential loss of progress. Note that any accidental loss of progress, working while studying, and/or a heavy load of assignments will not be accepted as the exceptional circumstances for an extension. For detailed information, please refer to the relevant section in the Unit Outline under the menu “Syllabus” in Canvas.

Assessment Criteria

Table 1 shows task number, summary details and points awarded when your work is assessed. A detailed rubric is available in the Canvas unit website under “Assignments” > “Assignment 1”. See there for complete details.

Note that the total for Task 1 is 30 points, Task 2 is 25 points, and Task 3 is 45 points, for a total of 100 points. Deductions will occur if General Requirements, as stated earlier, are not followed.

Table 1: Assessment Task, Summary Details and Points.

Task Summary Details Points 1.1 5

Loading the data

Loading the data from the three data files into three Pandas DataFrame entities and checking whether the loaded data are equivalent to the data contained in the raw data files

Merging

Merging the obtained three DataFrame entities into a single one that contains all records, where each record has a unique ID and all the listed attributes.

Cleaning

Cleaning the data by using the knowledge you have acquired.

Visualising categorical and numerical values

Choosing two columns with categorical and numerical values, respectively, and visualising each of them in an appropriate way.

Exploring relationships

Choosing three pairs of columns and exploring the relationship between the two columns involved in each pair via appropriate descriptive statistics and visualisation tools.

Scatter matrix

Choose six (6) numerical columns of interest and generate a scatter matrix.

Introduction

Purpose of report clearly stated and suitable preparation information for the next two sections of the report given. Content is consistent with the rest of the report.

Report Section – Task 1

Report on “Task 1: Data Acquisition & Preparation”

Report Section – Task 2

Report on “Task 2: Data Exploration”

Conclusion

Clear restating of the key outcomes presented in the report.

Total Points

Appendix: Data Attribute Details

Table 2: The set of 38 possible attributes for a student record with corresponding value ranges.

Feature Type Min Max

Discrete

5000

Marital status

Discrete

Application mode

Discrete

Application order

Discrete

Course

Discrete

10000

Daytime/evening attendance

Discrete

Previous qualification

Discrete

Previous qualification (grade)

Continuous

190

Nationality

Discrete

109

Mother’s qualification

Discrete

Father’s qualification

Discrete

Mother’s occupation

Discrete

200

Father’s occupation

Discrete

200

Admission grade

Continuous

190

Displaced

Discrete

Educational special needs

Discrete

Debtor

Discrete

Tuition fees up to date

Discrete

Gender

Discrete

Scholarship holder

Discrete

Age at enrolment

Discrete

International

Discrete

Curricular units 1st sem (credited)

Discrete

Curricular units 1st sem (enrolled)

Discrete

Curricular units 1st sem (evaluations)

Discrete

Curricular units 1st sem (approved)

Discrete

Curricular units 1st sem (grade)

Continuous

20.00

Curricular units 1st sem (without evaluations)

Discrete

Curricular units 2nd sem (credited)

Discrete

Curricular units 2nd sem (enrolled)

Discrete

Curricular units 2nd sem (evaluations)

Discrete

Curricular units 2nd sem (approved)

Discrete

Curricular units 2nd sem (grade)

Continuous

20.00

Curricular units 2nd sem (without evaluations)

Discrete

Unemployment rate

Continuous

-100

100

Inflation rate

Continuous

-100

100

GDP

Continuous

-100

100

Target (Dropout, Enrolled, Graduate)

Get Instant Help on WhatsApp