date: 2024-12-23
title: BDA-Stream Data
status: DONE
author:
- AllenYGY
tags:
- NOTE
- BDA
publish: True
BDA-Stream Data
Infinite Data
In many data analytics situations, we do not know the entire data set in advance
Stream Management is important when the input rate is controlled externally:
We can think of the data as infinite and non-stationary (the distribution changes over time)
Input elements enter at a rapid rate, at one or more input ports (i.e., streams)
The system cannot store the entire stream accessibly
Q: How do you make critical calculations about the stream using a limited amount of (secondary) memory?
Types of queries one wants on answer on a data stream: (we’ll do these today)
Sampling data from a stream
Queries over sliding windows
Filtering a data stream
Counting distinct elements
Estimating moments
Finding frequent elements
Solution: Sample Users
Pick
Use a hash function that hashes the user name or user id uniformly into 10 buckets
通用采样过程
为了从数据流中采样 a/ba/ba/b 的比例,可以按照以下步骤进行:
在内存受限的情况下,从数据流中维护一个固定大小 sss 的随机采样集合 SSS。