aws 认证
the highly important and carefully crafted piece, * this will only be useful after completing the entire course on Udemy
精心制作的作品非常重要,*只完成相关工作Udemy整门课后才有用
适用于AWS ML专业的Udemy课程 (Udemy Course for AWS ML Specialty)
备忘单 (Cheat Sheet)
降低SageMaker上自动超参数调整的成本 (Reduce the cost of Automatic Hyperparameter tuning on SageMaker)
- use log scales on parameter ranges 在参数范围内使用对数刻度
- less concurrent while tuning, cause it learns in different runs 调整时并发性较小,导致在不同的操作中学习
- have the smallest range of hyperparameters 具有最小范围的超参数
is an important metric in situations where classifications are highly imbalanced, and the positive case is rare. Accuracy tends to be misleading in these cases.
在分类高度不平衡的情况下, 这是一个重要的指标,而积极的案例很少见。 在这些情况下,准确性往往被误导。
- Ex: Fraud Detection 例如:欺诈检测
混淆矩阵备忘单— (Cheat Sheet for Confusion Matrix —)
更多的时代和过度拟合? (More epochs and overfitted?)
- use drop out regularization 使用辍学正则化
- early stopping of epochs is good advice 早停是个好建议
SageMaker支持笔记本实例Internet,在VPC潜在的安全漏洞。 (SageMaker notebook instances are Internet-enabled, creating a potential security hole in your VPC.)
- VPC Interface Endpoint(PrivateLink) VPC接口端点(PrivateLink)
- Modify instance’s security group to allow outbound connections for training and hosting. 修改实例安全组,允许出站连接进行培训和托管。
边缘 (Edge)
- SageMaker Neo IoT GreenGrass SageMaker Neo 物联网GreenGrass
- sample edge device — Nvidia Jetson 样品边缘设备— Nvidia Jetson
设计并推向边缘 (To design and push something to edge)
- design something to do the job, say TF model 设计能胜任的工作,比如TF模型
- compile it for the edge device using SageMaker Neo, say Nvidia Jetson Nvidia Jetson说,使用SageMaker Neo将其编译成边缘设备
- run it on the edge using IoT GreenGrass 使用IoT GreenGrass在边缘运行
亚马逊上的NLP —理解 (NLP on Amazon — Comprehend)
- Another solution would be to use natural language processing through a service such as Amazon Comprehend. 另一个解决方案是通过,例如Amazon Comprehend自然语言处理等服务。
您正在SageMaker训练有数百万行训练数据XGBoost并希望使用模型Apache Spark这些数据大规模预处理。 实现这一目标最简单的架构是什么? (You are training an XGBoost model on SageMaker with millions of rows of training data, and you wish to use Apache Spark to pre-process this data at scale. What is the simplest architecture that achieves this?)
The
classes allow tight integration between and for several models including , and offers the simplest solution类允许和在包括在内的多种模型之间进行紧密集成,并提供最简单的解决方案- Categorical — Deep Learning 分类-深度学习
- Numerical — kNN 数值— kNN
Use The use of spot instances in response to anticipated surges in usage is the most cost-effective approach for scaling up an EMR cluster.
使用使用竞价型实例来响应预期的使用激增是扩展EMR集群的最具成本效益的方法。
- What is that you don’t want to lose will be your loss function while building your model 您不想丢失的是构建模型时的损失函数
- Example: for fraud detection, you don’t want false negatives, so FN / FN + TP is the loss function 示例:对于欺诈检测,您不需要假阴性,因此FN / FN + TP是损失函数
- PCA PCA
- K-Means Clustering K均值聚类
this should have an env. variable
with value train.py in the Dockerfile这应该有一个环境。
具有值train.py的变量SAGEMAKER_PROGRAM- Transcribe(speech to text) → Lex(chatbot engine that works on intent) → Polly(that reads the given text (text to speech)) 转录(语音到文本)→Lex(可在意图上工作的聊天机器人引擎)→Polly(读取给定的文本(文本到语音))
- in real implementation we also use — DynamoDB and Lambdas too 在实际实现中,我们还使用了DynamoDB和Lambdas
- Factorization Machines → Sparse Data 分解机→稀疏数据
- Sparse Data → Factorization Machines 稀疏数据→因式分解机
- The choice with the lightest color along the diagonal axis is the correct one, as it represents the lowest number of correct predictions. 沿对角线轴颜色最浅的选择是正确的选择,因为它代表正确预测的最少数量。
- either both good or both bad 要么好要么坏
- Synthetic Minority Oversampling Technique 综合少数民族过采样技术
- use L2 instead of L1 使用L2代替L1
- or we can also just reduce the L1 regression term (this term means, how intense L1 was applied) 或者我们也可以只减少L1回归项(此项意味着应用L1的强度)
- Dropout regularization Technique 辍学正则化技术
- early stops of epochs 时代的早期停止
- use a few layers may help 使用几层可能会有所帮助
- splits data into a fixed number of buckets, with the same number of observations in each bin. 将数据分割成固定数量的存储桶,每个仓中的观察值数量相同。
- Quantile binning 分位数分档
- some intervals could have fewer items and some could have way more → this behavior loses the distribution visibility 一些间隔可能会减少项目的数量,而某些间隔可能会有更多的方法→此行为会失去分布可见性
- Horovod 霍罗沃德
- Parameter Servers 参数服务器
- Training with unshuffled data may cause training to fail. 使用未经改组的数据进行训练可能会导致训练失败。
- P — Precision P —精度
- R — Recall R —召回
- S3 → GlueCrawlers → Glue Data Catalog → Athena → QuickSight S3→粘合履带→粘合数据目录→雅典娜→QuickSight
- approach 1 : 方法1:
- - use PySpark + XGBoostSageMakerEstimator to prepare data using Spark -使用PySpark + XGBoostSageMakerEstimator使用Spark准备数据
- - then pass the data to SageMaker -然后将数据传递给SageMaker
- approach 2 : without using XGBoostSageMakerEstimator 方法2:不使用XGBoostSageMakerEstimator
- - use Spark on EMR to pre-process the data and store it back in same/another S3 -在EMR上使用Spark预处理数据并将其存储回相同/另一个S3中
- - keep S3 bucket accessible to SageMaker to train on -让SageMaker可以访问S3存储桶以进行培训
- always supervised for → discrete data 始终受监督→离散数据
- Deep Learning for → classification data 深度学习→分类数据
- mean or median next 下一个均值或中位数
- drop off next 接下来下车
- order → StepFunctions 订购→StepFunctions
- put data in S3 将数据放入S3
- use QuickSight’s native ML Insights feature 使用QuickSight的本机ML Insights功能
- also use QuickSight dashboard for visualization 还使用QuickSight仪表板进行可视化
- subsample 子样本
- alpha α
- eta eta
- gamma 伽玛
- lambda 拉姆达
- custom CNN for achieving computer vision or image detection 定制的CNN以实现计算机视觉或图像检测
- camera at location 相机在位置
- DeepLens 深镜头
- DeepLens_kinesis_Video Module DeepLens_kinesis_Video模块
- SageMaker 贤者
use clone this besides one and start building on top of it
使用克隆并在其上开始构建
- can be below or above 可以低于或高于
use clone this besides one and start building on top of it
使用克隆并在其上开始构建
- Transfer learning generally involves using an existing model or adding additional layers on top of one. 转移学习通常涉及使用现有模型或在模型之上添加其他层。
- handle sparse data 处理稀疏数据
RecordIO/protobuf in float32 format ()
float32格式的RecordIO / protobuf( )
- RecordIO is efficient RecordIO是高效的
it can access S3 buckets with ‘’ in name
它可以访问名称为“ ”的S3存储桶
Each line of the input file contains a training sentence per line, along with their labels. Labels must be prefixed with the , and the tokens within the sentence — including punctuation — should be space-separated.
- if using pipe mode, we don’t copy the data to the training machine 如果使用管道模式,我们不会将数据复制到训练机上
- we stream the data 我们流数据
- it makes a big diff. for big datasets 这带来了很大的不同。 适用于大型数据集
- requirements of pipe mode? → RecordIO Format 管道模式的要求? →RecordIO格式
- plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances. 在整个AWS计算服务和功能(例如Amazon EC2和竞价型实例)中计划,计划和执行批处理计算工作负载。
- orderly executed → Step Functions 有序执行→步骤功能
- just scheduling ability, but no order required → AWS Batch 仅具有计划功能,但无需订购→AWS Batch
- Too Large → overshoots true minima 太大→超出实际最小值
- Too Small → Slows down convergence, takes more time 太小→降低收敛速度,需要更多时间
- Too Large → stuck at local minima 太大→停留在局部最小值
- Less Size → true minima 较小的尺寸→真正的最小值
when training usually we want it to perform less bad of one quality
通常,当我们训练时,我们希望它表现出一种劣质的表现
- that one quality → Loss Function 那个质量→损失函数
- that actually less bad → actual minimal bad. → actual minima → true minima 那实际上更少的坏→实际上最小的坏。 →实际最小值→真实最小值
- we need to provide vocabulary files 我们需要提供词汇文件
our words into integers
我们的单词为整数
- RecordIO-protobuf format with integer tokens 具有整数标记的RecordIO-protobuf格式
- You Speak in language 1 → AWS Transcribe → AWS Translate → AWS Polly speaks in language 2 您以语言1说→AWS Transcribe→AWS Translate→AWS Polly以语言2说
- this is for sentiment analysis 这是用于情绪分析
- because only the sentiment analysis → order of words doesn’t matter 因为只有情感分析→单词顺序无关紧要
- Uses Skip-gram and CBOW-Continuous Bag Of Words 使用Skip-gram和CBOW连续词袋
- BlazingText doesn’t use LSTM or CNN BlazingText不使用LSTM或CNN
already trained model under the hood of AWS
已在AWS 下训练过的模型
- Kinesis Data Analytics has ❤️ a native Random Cut Forest algorithm, use that. Kinesis Data Analytics具有❤️本机的Random Cut Forest算法,请使用该算法。
- Random Cut Forest is Amazon’s own algorithm for anomaly detection and is usually the right choice when anomaly detection is asked for on the exam. It is implemented within both Kinesis Data Analytics and SageMaker, but only Kinesis works in the way described. Random Cut Forest是Amazon自己的异常检测算法,通常是在考试中要求进行异常检测时的正确选择。 它在Kinesis Data Analytics和SageMaker中均已实现,但只有Kinesis可以按所述方式工作。
- feeds the same neuron(so named recurrent — reoccurring) 喂养相同的神经元(所以称为复发性-重复发生)
- if the depth of persistence of this feed that is fed → LSTM — long or short 如果所喂入的这种喂食的持续深度→LSTM —长还是短
- it is a time-series problem 这是一个时序问题
- use RNN 使用RNN
- JSON Data input as Kinesis Streams JSON数据输入为Kinesis Streams
- - send to Firehose -发送给Firehose
- Supply to Kinesis Firehose 供应给Kinesis Firehose
- - convert to Parquet or ORC and load to S3 -转换为Parquet或ORC并加载到S3
- Athena queries from S3 using Glue Crawler and Glue Data Catalog and provides Analytics 雅典娜使用Glue Crawler和Glue Data Catalog从S3查询并提供分析
- use ReLU 使用ReLU
- from multiplying together many small derivates of the sigmoid activation function in multiple layers 将多层S型激活函数的许多小导数相乘
- both are algorithms 两者都是算法
- Object2Vec creates embeddings for arbitrary objects, like Tweets Object2Vec为任意对象(如推文)创建嵌入
- BlazingText can only find relationships between words but not entire tweets BlazingText只能找到单词之间的关系,而不能找到整个推文
- M4 M4
- XGBoost is a CPU-only algorithm XGBoost是仅CPU的算法
- no benefit from GPUs 无法从GPU中受益
- GPU Type → P3 or P2 GPU类型→P3或P2
- GPU — Accelerated Computing GPU —加速计算
- P, G P,G
- CPU — Standard CPU —标准
- M, T M,T
- Memory-Optimized — Current generation 内存优化-当前一代
- R [R
- Compute Optimized — Current generation 优化计算-当前的一代
- C C
- Inference Accelerator 推理加速器
- another level 另一个层面
- kNN 神经网络
- SVM + RBF 支持向量机+ RBF
- SVM — Simple Vector Machine SVM —简单的矢量机
- RBF — Radial Basis Function RBF —径向基函数
- discard them by identifying as being outside some multiple of a standard deviation from the mean 通过将其标识为与平均值相差某个标准偏差的倍数,将其丢弃
- Glue ETL — FindMatchesML ❤️ feature 胶水ETL — FindMatchesML❤️功能
- LDA — Latent Dirichlet Allocation, Unsupervised Topic Modeling LDA —潜在Dirichlet分配,无监督主题建模
- NTM — Neural Topic Model — SageMaker Algorithm NTM —神经主题模型— SageMaker算法
- SageMaker LDA Algorithm SageMaker LDA算法
- SageMaker NTM Algorithm SageMaker NTM算法
- Amazon Comprehend also (this does sentiment and full) 亚马逊还理解(这确实感悟和充分)
- If no outliers? → Mean 如果没有异常值? →均值
- If yes outliers? → Median 如果是,则有异常值吗? →中位数
- Yes 是
— are made for this
—为此而制造
- purpose like Tesla Shadow Mode 特斯拉阴影模式
- AUC — Area Under Curve AUC —曲线下面积
- ROC — Receiver Operating Characteristic ROC —接收器工作特性
- Good ROC will be curved up toward (0,1) 好的ROC会向上弯曲(0,1)
- Perfect AUC is 1.0 完美的AUC为1.0
- put tags on SageMaker resources 将标签放在SageMaker资源上
use conditions in IAM Policies to choose these tags of SageMaker instances
使用IAM策略中的条件选择SageMaker实例的这些标签
- Inter-container encryption is just a checkbox away when creating a training job via the SageMaker console. 通过SageMaker控制台创建培训作业时,容器间加密只是一个复选框。
- It can also be specified using the SageMaker API with a little extra work 也可以使用SageMaker API进行一些额外的工作来指定它
Your inference container responds to
port 8080, and您的推理容器响应
port 8080,并且must respond to
pingrequests in under2 seconds.必须在
2 seconds.响应ping请求2 seconds.Model artifacts need to be compressed in
tarformat, not zip.模型工件需要以
tar格式而不是zip压缩。- to optimize? 优化?
WSS is one way, also called an method
WSS是一种方法,也称为方法
您无法将SageMaker部署到EMR集群 (You can’t deploy SageMaker to an EMR cluster)
XGBoost实际上需要LibSVM或CSV输入 (XGBoost actually requires LibSVM or CSV input)
归纳最佳ML填充选择? (Imputation best ML filling choices?)
ML和流量峰值是否偶尔出现? (if any, ML and spike Of traffic sporadically?)
语义分割” (— Semantic Segmentation)
什么是损失函数? (What is Loss Function?)
降低尺寸 (Reduce the Dimensionality)
KNN-受监督; K均值—无监督 (KNN — Supervised; K-Means — Unsupervised)
/opt/ml/code/train.py (/opt/ml/code/train.py)
使用S3前缀按日期组织数据可以使Glue按日期对数据进行分区,从而可以更快地查询日期范围。 (Organizing data by date using S3 prefixes allows Glue to partition the data by date, which leads to faster queries done on date ranges.)
S3生命周期策略可以自动化将旧数据归档到Glacier的过程。 (S3 lifecycle policies can automate the process of archiving old data to Glacier.)
制作自己的Alexa (Make your own Alexa)
在您首先进行培训之前, 不会知道您的公司徽标,也不会知道对象检测。 ( won’t know about your company logo, nor will Object Detection until you have trained it first.)
虽然Ground Truth可以选择使用Mechanical Turk的劳动力,但它是专门为此类任务而设计的,可以很快设置 (While Ground Truth can use the Mechanical Turk workforce as an option, it is purpose-built for this sort of task and can be set up very quickly)
分解机器与处理稀疏数据有关,但是它们本身并不执行降维。 (Factorization machines are relevant to handling sparse data, but they don’t perform dimensionality reduction per se.)
PCA是一种强大的降维技术,可以找到最佳尺寸。 (PCA is a powerful dimensionality reduction technique that will find the best dimensions.)
给定多轴混淆矩阵作为具有对角轴的热图 (Given a multi-axis confusion matrix as a heat map with a diagonal axis)
我们永远不能说图表捕捉的趋势不错,但季节性不好。 (We can never say a graph is capturing trend good but seasonality bad.)
季节性是指周期性的变化,而趋势是随时间推移的长期变化。 (Seasonality refers to periodic changes, while trends are longer-term changes over time.)
Kinesis Analytics可以使用SQL本机进行最少的转换。 (Kinesis Analytics can do minimum transformation natively using SQL.)
Amazon Forecast-AWS上的RTF服务以进行预测。 (Amazon Forecast — RTF Service on AWS for forecasting.)
您正在使用EMR,请使用S3→始终使用EMRFS (You are on EMR, to use S3 → always EMRFS)
SMOTE-巧妙的过采样技术 (SMOTE — an ingenious oversampling technique)
大批量处理→卡在局部最小值中→您将错过真正的最小值 (Large Batch Size → stuck in local minima → you will miss true minima)
L1正则化技术→减少功能(对修复过度拟合非常有用)→如果执行得太过激,也可能会过早拟合。 (L1 Regularization Technique → reduces features (very useful to fix overfitting) → if done too aggressive might also under fit too soon.)
L2正则化技术→权衡每个特征而不是将其全部删除,这可以提高准确性 (L2 Regularization Technique → it weights each feature instead of removing them entirely, which can lead to better accuracy)
解决不合身? (Tackle underfitting?)
解决过度拟合? (Tackle Overfitting?)
分位数分档 (Quantile Binning)
分布不均的数据并保持分布 (unevenly distributed data and preserve the distribution)
如果使用间隔合并怎么办? (What if used Interval binning?)
SageMaker分布式培训 (SageMaker Distributed Training)
can’t be done out of the box
开箱即用
训练失败了吗? (Did training fail?)
培训数据应始终规范化和改组。 (Training data should be normalized and shuffled, all the time.)
Sage Maker Linear Learner支持和任务。 (Sage Maker Linear Learner supports both and tasks.)
F1得分→2.PR/(P + R) (F1 Score → 2.P.R/(P + R))
Glue和Glue ETL可以为非结构化数据赋予结构,并在接收到该数据时对其进行转换。 (Glue and Glue ETL can impart structure to unstructured data, and perform transformations on that data as it is received.)
Athena是一种无服务器解决方案,与Glue配对后可以直接查询S3数据湖 (Athena is a serverless solution that can query S3 data lakes directly when paired with Glue)
S3中的数据,是否需要可视化? (data in S3 and need visualizations?)
当您要准备大量数据时→您总是希望并行完成数据,而是唯一擅长的数据。 (when you want to prepare so much data → you always want it to be done in parallel and is the only one good at it.)
S3上有这么多数据并将其用于ML? (so much data on S3 and use it for ML?)
Glue ETL和Kinesis Analytics都不能转换为格式 (Neither Glue ETL nor Kinesis Analytics can convert to format)
不适用于分布式解决方案。 ( is not for a distributed solution.)
LibSVM —支持向量机的库 (LibSVM — A Library for Support Vector Machines)
最好的插补技术是什么? (What is the best imputation technique?)
培训涉及多个长期运行的ETL作业,这些作业需要按顺序执行 (training involves multiple long-running ETL jobs which need to execute in order)
QuickSight的ML Insights功能允许使用QuickSight本身进行预测。 这是一种包含最少数量组件的无服务器解决方案。 (QuickSight’s ML Insights feature allows forecasting using QuickSight itself. This is a serverless solution that contains the least number of components.)
完全没有开销的预测? (Forecasting without overhead at all?)
XGBoost超参数 (XGBoost hyperparameters)
当假阴性的成本高于假阳性的成本时,召回(TP /(TP + FN))很重要。 (Recall (TP / (TP+FN)) is important when the cost of a false negative is higher than that of a false positive.)
在装有相机的地方检测到自定义徽标或T恤? (detect a custom logo or t-shirt from a place with cameras?)
快速在当前分类器旁边建立另一个分类器? (quickly build another classifier beside the current one?)
转移学习 (transfer learning)
分解机→float32 (Factorization Machines → float32)
分解机 (Factorization Machines)
对于SageMaker管道模式 (For SageMaker Pipe Mode)
SageMaker Notebook(如果使用默认IAM创建) (SageMaker Notebook if created with default IAM)
除非您将具有S3FullAccess权限的策略添加到角色,否则策略将仅限于存储桶名称中带有“ sagemaker”的存储桶。 奇怪但真实。 (Unless you add policy with S3FullAccess permission to the role, it is restricted to buckets with “sagemaker” in the bucket name. Strange but true.)
炽烈的文字格式 (Blazing Text format)
为什么是管道模式? (why Pipe mode?)
SageMaker LDA→仅管道模式→因此RecordIO (SageMaker LDA → only Pipe mode → so RecordIO)
SageMaker LDA→仅在单个实例上进行培训 (SageMaker LDA → training on an only single instance)
SageMaker分解机→RecordIO && float32 (SageMaker Factorization Machines → RecordIO && float32)
AWS批处理 (AWS Batch)
复杂的工作流程? (complex workflow ?)
学习率 (Learning Rate)
批量大小 (Batch Size)
真正的最低要求是什么? (What is this true minima?)
SageMaker Seq2Seq (SageMaker Seq2Seq)
()
炽热的文字 (BlazingText)
在用于神经网络之前,必须将分类特征转换为一元热的二进制表示形式。 (Categorical features need to be converted into one-hot, binary representations prior to use in a neural network.)
()
()
名人检测 (Celebrity Detection)
检测流中的某些异常? (to detect some anomaly on a stream?)
LSTM — RNN的特定种类,长期短期记忆 (LSTM — specific kind of RNN, Long Short Term Memory)
RNN (RNN)
产生音乐。 ? (Generate Music. ?)
Kinesis Firehose能够即时将JSON数据转换为Parquet或ORC格式。 (Kinesis Firehose has the ability to convert JSON data to Parquet or ORC format on the fly.)
当使用Parquet或ORC等列格式时,Athena的执行效率更高,成本更低, (Athena performs much more efficiently and at lower cost when using columnar formats such as Parquet or ORC,)
无服务器分析。 ? (Serverless Analytics. ?)
AWS Rekognition可以立即识别图像中的常见对象。 (AWS Rekognition can identify common objects in images right out of the box.)
Comprehend可用于为帖子中的文本生成主题。 (Comprehend could be used to produce topics for the text in the posts.)
理解— RTF AWS NLP (Comprehend — RTF AWS NLP)
BlazingText —只是SageMaker上NLP的一种算法 (BlazingText — Just an Algorithm for NLP on SageMaker)
消失的梯度? (Vanishing Gradient?)
梯度消失的原因? (reasons for vanishing gradient?)
SageMaker Object2Vec与SageMaker BlazingText (SageMaker Object2Vec vs. SageMaker BlazingText)
XGBoost实例类型? (XGBoost instance type?)
实例类型-https: //aws.amazon.com/sagemaker/pricing/instance-types/ (Instance Types — https://aws.amazon.com/sagemaker/pricing/instance-types/)
非线性聚类解决方案 (Non — linear clustering solutions)
离群值会使线性模型倾斜。 (Outliers can skew linear models.)
竞价型实例→EMR上的任务节点 (Spot Instances → task nodes on EMR)
重复数据删除? (Deduplication?)
为一堆文本分配主题 (Assign topics for a bunch of texts)
寻找话题 (find topics)
归咎于? (Imputation?)
SageMaker的新模型可以在不影响客户的情况下进行测试吗? (SageMaker's new model can be tested without impact to customers?)
曲线 (Curves)
建议使用SageMaker Linear Learner改组 (Shuffling is recommended with SageMaker Linear Learner)
如何控制特定IAM组对SageMaker笔记本的访问? (how to control access to SageMaker notebooks to specific IAM Groups?)
由于数据集中的PII数据而在进行训练时进行完全加密? (Full Encryption while training due to PII data in the dataset?)
自定义推理容器要求? (Custom Inference Container requirements?)
K-Means是不受监督的。 (摘自备忘录-KUM) (K-Means is unsupervised. (from memo — KUM))
End.
结束。
翻译自: https://medium.com/swlh/cheat-sheet-for-aws-ml-specialty-certification-e8f9c88566ba
aws 认证