title: Hands on Data Analytics for Everyone
date: 2023-01-16
status: DONE
tags:
- DataAnalysis
- NOTE
author:
- AllenYGY
created: 2024-01-16
updated: 2024-06-14T17:24
publish: True
Hands on Data Analytics for Everyone
Data->Data Preparation->Model Training->Model Optimization->Model Testing
E-mail
, Wechat
(Social-media)csv
format, new records are separated by new line第一条记录,可以是字段名: Year,Make,Model,Description,Price
每条记录占一行 以逗号为分隔符 e.g.(1997,Ford,E350,"ac, abs, moon",3000.00)
逗号前后的空格会被忽略
字段中包含有逗号,换行符,空格,该字段必须用双引号括起来
字段中的双引号用两个双引号表示
字段中如果有双引号,该字段必须用双引号括起来 aa,"bb,""cc" ctrl
If viewed as a pipeline, data analytics is the bridge that connects statistics and computer science.
It focuses on using statistical methods to discover insights from data, Statistics is more traditional and theoretical, Computer science focuses on solving all problem in a computable way, including topics in computability, algorithms, system design, networks, artificial intelligence, software engineering,etc.
Statistical measures can be used to describe a dataset
Bar chart is discrete.
Histogram is continuous.
Bar chart is suitable for categorical data while histogram is for numeric data
The two peaks of the original distribution are no longer visible, and one gets the wrong impression that the distribution is unimodal.
Usually leads to a very scattered histogram in which it is difficult to distinguish true peaks from random peaks.
Boxplot [2]
Boxplots are a very compact way to visualize and summarize the main characteristics of a numeric attribute, through the median, the IQR, and possible outliers.
Requires min-max-normalization of numeric columns
Concatenation: (column do not change)
Join: (column change)
The learner is provided with a set of data inputs together with the corresponding desired outputs
Training examples as input patterns, with no associated output
The target variable that we’re trying to predict is continuous. eg.(living areas and prices)
The target variable can take on only a small number of discrete values. eg.(insurance)
Given a training set, to learn a function (hypothesis/model) f: X ⟼ Y, so that f(x) is a “good” predictor for the corresponding value of y.
The percentage of test set tuples that are correctly classified by the classifier
Class | C1(predicted) | C2(predicted) | Total | Accuracy |
---|---|---|---|---|
C1 | true positives (TP) | false negatives (FN) | positives(P) | TP/P |
C2 | false positives (FP) | true negatives (TN) | negatives(N) | TN/N |
Total | predicted positives(Pp) | predicted negatives(Pn) | All | (TP+TN)/All |
Discover hidden structures in unlabeled data
Clustering identifies a finite set of groups (clusters)