**Minitab SPSS R Statistical Analysis**

For Questions 1 and 2 of your first assignment generate Random samples of size 200-

250 to perform the following tasks. You can assume that the population size for the travel

agent survey is 2500 and for the regular illicit drug addicts’ survey is 2000.

Before generating your data, please read this assignment carefully and add questions

that are required in this assignment to your questionnaire (if you’re original

questionnaire of assignment 1 does not cover the following questions).

You can use Minitab or SPSS or R for statistical analysis.**
**1- Based on your own questionnaire, use your sample data to summarise the

responses for the travel agent business strategy and the government reduction

of the drug policies. These strategies and policies should have been addressed

in your questions. Please pay particular attention to the following:

a- Use graphs and statistical analysis whenever is possible.

b- Most of you for the travel agent questionnaire have included questions that

ask for the age of the respondents and the amount they would spend for their

trip, use stratified sampling (at least two strata) for these two questions to

provide an estimate of the average age for the travelers and the total amount

**spent towards their travel. Use the sample size n (above) and define N1,**

N2,… yourself (in a logical manner that makes sense in your questionnaire )

for each strata, then use proportional allocation rule to obtain n1, n2,….

Provide the error of both estimations.

c- Use the handout given to you for the Statistics module 3 to assess if the

amount spends by the different age group is the same or not. If it is not the

same, then identify the groups that are different.

d- For the travel questionnaire, in your report discuss the age distribution,

amount of money to be spend, popular destinations, and so on. You can use

a pie chart for this section. If the questionnaire that you have designed does not

have questions about age, the amount they spend, destinations, please add these

questions to your questionnaire before generating data.

e- For the regular illicit drug addicts you can discuss their age distribution,

family background, how they were first introduced to drugs, how much do

they spend on drugs, do they want to receive help to stop their addictions.

Again add these questions to your questionnaire before generating data.

Note: These are just a few points that you can address. For this assignment, I

**would rank your work in terms of quality and then allocate marks according**

**to the quality of the submitted work.**

**2- Write a maximum of 2 A4 sheets (font 11) report for each questionnaire. You**

do not need to include your graphs and statistical analysis in your report (put

them in the appendix) but you should refer to them with a graph or table

number. Make sure that you briefly state your research question and target

the population at the beginning of your final report (there is no need to mention

the sample and procedures of collecting data you have already been assessed

on these points in assignment 1). Your report should be able to guide the travel

agent or government to set up their business strategies or policies in a very

informative way.

**Solution**

## Analysis of the drug problem

### Introduction

The purpose of the analysis is to identify a group of people who are ready to overcome drug addiction. Identified results in the analysis, the government can use to combat drug addiction.

The survey involved 2,500 people who were grouped according to when the first time tried drugs. Thus we have a data set with 250 surveys and 9 variables. The data set consists of 9 variables that describe the people’s sociological characteristics.

### Methods

After that, I built columnar histograms for categorical variables on the converted data. The bar graphs showed the percentage distribution of categorical data. In the analysis of data, the method of mean values and standard deviations was also used. To explain the data, a decision tree model was used.

In order to understand the structure of spending between different groups, a diagram was constructed “Alluvial diagram”.

### Results

In the course of the analysis, variables that most influenced a person’s predisposition to fight drug addiction:

- Employment Status
- Estimation of Health
- Age, when the drug was first tried

**Structure of data between Employment status and predisposition to combating drugs:**

The Chi test shows that we can not reject the hypothesis, that the Employment status influences the predisposition to struggle with drugs. Nevertheless, the data structure shows that students often sought help, unlike other groups.

**Structure of data between Estimation of Health and predisposition to combating drugs:**

The Chi test shows that we can not reject the hypothesis, that the Estimation Health influences the predisposition to struggle with drugs. Nevertheless, the data structure shows that people with good health are more able to seek help.

**Structure of data between First_time_drugs and predisposition to combating drugs:**

The Chi test shows that we can not reject the hypothesis, that the Estimation Health influences the predisposition to struggle with drugs.

Nevertheless, the data structure shows that people aged 18 to 21 are more likely to seek help. At the same time, the age group from 21 to 25 prefer not to seek help

The distribution of data by gender is approximately the same (image1). The most common data are on the basis of education as an “Undergraduate University Degree” (image2). Most people in the survey assess their health at levels 2 (image 5). Most interviewees prefer to overcome loneliness through games in sports and video games. But also about 16% of those questioned prefer to overcome loneliness through drugs (image6). Most people in the survey tried drugs for the first time at the age of 20 to 25 years. (image7).

The decision tree split occurred on a variable Employment Status. People who are busy with work and study and assess their health more than 5 points would most likely turn to for help in treatment for drug addiction. People who are not engaged in work and study, as well as those who are less than 18 years of age, would most likely not have sought help in drug dependence treatment.

### Conclusion

Analysis of the data showed that we can not reject the null hypothesis. Nevertheless, analysis of the data structure indicated that there are problems in the age group of students from 21 to 25 years. This age group most often does not seek help. Therefore, the government needs prvoysti appropriate programs on the dangers of drugs, in order to reduce the risk of drug use in this age group.

The analysis of drug dependence between different groups revealed problem groups of the population who need help. The decision tree shows that there is a problem of drug dependence treatment among adolescents. Therefore, the government needs to take steps to address this problem. (image 9)

## Appendix

Image 1

Image 2

Image 3

Image 4

Image 5

Image 6

Image 7

Image 8

Image 9

## R code

library(wakefield)

library(ggplot2)

library(data.table)

library(dplyr)

##Question 1, travel

set.seed(1234)

drugs <- r_data_frame(

n = 250,

Gender = r_sample_factor(c(“Female”, “Male”)),

Education = r_sample_factor(c(“Did not complete Year 12”, “Completed Year 12 (VCE)”, “TAFE Qualification”, “Undergraduate University Degree”, “Postgraduate University Degree”)),

Employment_status = r_sample_factor(c(“Student”, “Part time employee”, “Not employed”, “Full time employee”, “Casual employee”, “Other”)),

Relationship_status = r_sample_factor(c(“Never married”, “Divorced”, “In a relationship”)),

Health_condition = r_sample_factor(c(“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “10”)),

Overcome_Alone = r_sample_factor(c(“Consume some alcohol”, “Drugs”, “Watching videos”, “Playing a sport”, “Meeting people”, “Other”)),

First_time_drugs = r_sample_factor(c(“<18”, “18-21”, “21-25”, “30-40”, “>40”)),

Neglected = r_sample_factor(c(“Yes”, “No”, “Never used drugs”)),

Tried = r_sample_binary_factor(c(“Yes”, “No”, “Never used drugs”))

)

str(drugs)

sum(is.na(drugs))

summary(drugs)

#Gender

drugs %>%

group_by(Gender) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = Gender, y = proportion, fill = Gender)) + geom_bar(stat = ‘identity’)

#Education

drugs %>%

group_by(Education) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = Education, y = proportion, fill = Education)) + geom_bar(stat = ‘identity’)

#Employment_status

drugs %>%

group_by(Employment_status) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = Employment_status, y = proportion, fill = Employment_status)) + geom_bar(stat = ‘identity’)

#Relationship_status

drugs %>%

group_by(Relationship_status) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = Relationship_status, y = proportion, fill = Relationship_status)) + geom_bar(stat = ‘identity’)

#Health_condition

drugs %>%

group_by(Health_condition) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = Health_condition, y = proportion, fill = Health_condition)) + geom_bar(stat = ‘identity’)

#Overcome_Alone

drugs %>%

group_by(Overcome_Alone) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = Overcome_Alone, y = proportion, fill = Overcome_Alone)) + geom_bar(stat = ‘identity’)

#First_time_drugs

drugs %>%

group_by(First_time_drugs) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = First_time_drugs, y = proportion, fill = First_time_drugs)) + geom_bar(stat = ‘identity’)

#Neglected

drugs %>%

group_by(Neglected) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = Neglected, y = proportion, fill = Neglected)) + geom_bar(stat = ‘identity’)

#Tried

drugs %>%

group_by(Tried) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = Tried, y = proportion, fill = Tried)) + geom_bar(stat = ‘identity’)

library(rpart)

model.drugs <- rpart(Tried~., drugs)

prp(model.drugs, tweak = 1)

library(alluvial)

#Employment_status travel

drugs.factors <- drugs %>%

group_by(Employment_status, Tried) %>%

count() %>%

ungroup()

alluvial(drugs.factors[,1:3], freq = drugs.factors$n)

with(drugs, table(Employment_status, Tried))

with(drugs, chisq.test(Employment_status, Tried))

#Health_condition travel

drugs.factors <- drugs %>%

group_by(Health_condition, Tried) %>%

count() %>%

ungroup()

alluvial(drugs.factors[,1:3], freq = drugs.factors$n)

with(drugs, table(Health_condition, Tried))

with(drugs, chisq.test(Health_condition, Tried))

#First_time_drugs travel

drugs.factors <- drugs %>%

group_by(First_time_drugs, Tried) %>%

count() %>%

ungroup()

alluvial(drugs.factors[,1:3], freq = drugs.factors$n)

with(drugs, table(First_time_drugs, Tried))

with(drugs, chisq.test(First_time_drugs, Tried))

….** **

## Analysis of passenger traffic in airlines.

### Introduction

The purpose of the study is to understand how much passengers spend money depending on various сharacteristics. By understanding the average costs for each group, the airline will be able to better optimize its business strategy.

The survey involved 2,500 people who were grouped according to age. Thus we have a data set with 250 surveys and 17 variables.. The data set consists of 17 variables that describe the client’s preferences depending on their demographic and sociological characteristics.

### Methods

To construct the hypothesis for which the data was analyzed, the graphical method was used.

For the analysis, a graphical analysis was used to determine the dependencies. The method of mean and standard deviation was also used to explain how data deviates from its mean values

I took steps to identify linear and non-linear dependencies. For this, I conducted an exploratory analysis. During which the structure of the data and the spread of the data were clarified. This is necessary in order to understand the emissions in the data, as well as their confidence intervals.

After that, I built columnar histograms for categorical variables on the converted data. The bar graphs showed the percentage distribution of categorical data.

In order to understand the structure of spending between different groups, a diagram was constructed “Alluvial diagram”.

I also built a tree solution to explain all these categorical data and understand the interrelationships between them.

### Results

To determine the importance of variables, a decision tree was used. It was found that the following variables strongly influence the cost predictions:

- Preferred directions of travel
- Employment status
- Insurance
- Seasons
- Age

**The structure of the analysis according to popular directions depending on the budget:**

The analysis shows that with a budget of more than $ 1000 people prefer to rest in directions Beaches and Hiking. With a budget of 200 to 500 dollars, people prefer to rest in Adventures, Waterfalls, Historical. With a budget of 500 to 1000 dollars, people prefer to rest in Beaches, Outdoors, Places. With a budget of less than 200 dollars, people prefer to rest in Waterfalls, Historical, Places, Beaches.

To test the relationship between the two categorical variables, I used the Chi-squared test. The results show the relationship between the Budget and Prefer travel at the level of significance p-value = 0.054. The analysis shows that we can not reject the hypothesis that there is no relationship between the data.

**The structure of the analysis according to the employment status, depending on the budget, is as follows:**

People with a budget of more than $ 1000, often are not students. It should be noted that people with part-time jobs do not travel with a budget of less than $200.

To test the relationship between the two categorical variables, I used the Chi-squared test. The results show the relationship between the Budget and employment status at the level of significance p-value = 0.88.

The analysis shows that we can not reject the hypothesis that there is no relationship between the data.

The analysis revealed that the average age of passengers is between 25 and 40 years. (image1). Students often travel in comparison with other groups (image2). Most often, passengers travel in the rain (image 4). Most often, passengers travel with friends (image6). Most passengers do not prefer to consume local food (image 9). People tend to rent a house nearby (image 11). People also prefer to travel to certain places (image 16).

It should also be noted that in the data there are no emissions and some strong disproportions.

In order to identify such non-linear connections, a decision tree was constructed, where the Budget was predicted as the variable. The decision tree is split on employment status “Student”. Then the tree split goes to the variables Gender and Accommodation.

### Conclusion

The analysis of traveling shows that there is no clearly defined linear dependence between the data. Nevertheless, the analysis of non-linear dependencies shows that there are patterns with which you can predict how much money people will spend in their travel. It was found that students are the most active group. At the same time, male students spend the greatest amount of money.

We were able to identify some of the dependencies, but at the same time, we can not reject the null hypothesis.

Data analysis allows you to divide the population into groups that spend different amounts of money on the road depending on characteristics. The government can use the decision tree to take appropriate measures.

### Appendix

Image #1

Image #2

Image #3

Image #4

Image #5

Image #6

Image #7

Image #8

Image #9

Image #10

Image #11

Image #12

Image #13

Image #14

Image #15

Image 16

Image 17

Image 18

## R code

library(wakefield)

library(ggplot2)

library(data.table)

library(dplyr)

##Question 1, travel

set.seed(1234)

travel <- r_data_frame(

n = 250,

age = r_sample_factor(c(“<18”, “18-25”, “25-30”, “30-40”, “> 40”)),

employment_status = r_sample_factor(c(“Student”, “Part time employee”, “Not employed”, “Full time employee”, “Casual employee”, “Other”)),

Gender = r_sample_factor(c(“Female”, “Male”)),

Season = r_sample_factor(c(“Summer”, “Winter”, “Spring”, “Rainy”, “Not Specific”)),

Budget = r_sample_factor(c(“<200”, “200-500”, “500-1000”, “>1000”)),

Whoom_travel = r_sample_factor(c(“Friends”, “Family”, “Colleagues”, “Alone”)),

Insurance = r_sample_factor(c(“Yes”, “No”, “Not specific”)),

Airport_pick_up = r_sample_factor(c(“Yes”, “No”)),

Food_local = r_sample_factor(c(“Yes”, “No”, “Not specific”)),

Accommodation = r_sample_factor(c(“Hotel”, “House”, “Hostel”, “ottage”, “Camping”)),

Accommodation_closer = r_sample_binary_factor(c(“Yes”, “No”, “Not specific”)),

Distance = r_sample_factor(c(“<100”, “100-200”, “200-300”, “>300”, “None”)),

Camp_sites = r_sample_factor(c(“Yes”, “No”, “Not specific”)),

Services = r_sample_factor(c(“Room service”, “Sap”, “Massage”, “laundry”, “Dinner”)),

Tourist_guides = r_sample_factor(c(“Yes”, “No”, “Not specific”)),

Prefer_travel = r_sample_factor(c(“Beaches”, “Hiking”, “Outdoors”, “Adventures”, “Waterfalls”, “Historical”, “Places”)),

Get_information = r_sample_factor(c(“Google”, “Direct walks”, “Friends”, “Collegues”))

)

str(travel)

sum(is.na(travel))

summary(travel)

#Exploratory travel

#Age

travel %>%

group_by(age) %>%

summarize(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x = age, y = proportion, fill = age)) + geom_bar(stat = ‘identity’)

#employment_status

travel %>%

group_by(employment_status) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=employment_status, y = proportion, fill = employment_status)) + geom_bar(stat = ‘identity’)

#Gender

travel %>%

group_by(Gender) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Gender, y = proportion, fill = Gender)) + geom_bar(stat = ‘identity’)

#Season

travel %>%

group_by(Season) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Season, y = proportion, fill = Season)) + geom_bar(stat = ‘identity’)

#Budget

travel %>%

group_by(Budget) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Budget, y = proportion, fill = Budget)) + geom_bar(stat = ‘identity’)

#Whoom_travel

travel %>%

group_by(Whoom_travel) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Whoom_travel, y = proportion, fill = Whoom_travel)) + geom_bar(stat = ‘identity’)

#Insurance

travel %>%

group_by(Insurance) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Insurance, y = proportion, fill = Insurance)) + geom_bar(stat = ‘identity’)

#Airport_pick_up

travel %>%

group_by(Airport_pick_up) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Airport_pick_up, y = proportion, fill = Airport_pick_up)) + geom_bar(stat = ‘identity’)

#Food_local

travel %>%

group_by(Food_local) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Food_local, y = proportion, fill = Food_local)) + geom_bar(stat = ‘identity’)

#Accommodation

travel %>%

group_by(Accommodation) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Accommodation, y = proportion, fill = Accommodation)) + geom_bar(stat = ‘identity’)

#Accommodation_closer

travel %>%

group_by(Accommodation_closer) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Accommodation_closer, y = proportion, fill = Accommodation_closer)) + geom_bar(stat = ‘identity’)

#Distance

travel %>%

group_by(Distance) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Distance, y = proportion, fill = Distance)) + geom_bar(stat = ‘identity’)

#Camp_sites

travel %>%

group_by(Camp_sites) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Camp_sites, y = proportion, fill = Camp_sites)) + geom_bar(stat = ‘identity’)

#Services

travel %>%

group_by(Services) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Services, y = proportion, fill = Services)) + geom_bar(stat = ‘identity’)

#Tourist_guides

travel %>%

group_by(Tourist_guides) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Tourist_guides, y = proportion, fill = Tourist_guides)) + geom_bar(stat = ‘identity’)

#Prefer_travel

travel %>%

group_by(Prefer_travel) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Prefer_travel, y = proportion, fill = Prefer_travel)) + geom_bar(stat = ‘identity’)

#Get_information

travel %>%

group_by(Get_information) %>%

summarise(count = n()) %>%

mutate(proportion = count/sum(count)) %>%

ggplot(aes(x=Get_information, y = proportion, fill = Get_information)) + geom_bar(stat = ‘identity’)

library(rpart)

library(rpart.plot)

model.travel <- rpart(Budget~., travel)

prp(model.travel, tweak = 1.4)

summary(model.travel)

library(alluvial)

#Prefer travel

travel.factors <- travel %>%

group_by(Prefer_travel, Budget) %>%

count() %>%

ungroup()

alluvial(travel.factors[,1:3], freq = travel.factors$n)

with(travel, table(Prefer_travel, Budget))

with(travel, chisq.test(Prefer_travel, Budget))

#Employment status

travel.factors <- travel %>%

group_by(employment_status, Budget) %>%

count() %>%

ungroup()

alluvial(travel.factors[,1:3], freq = travel.factors$n)

with(travel, table(employment_status, Budget))

with(travel, chisq.test(employment_status, Budget))