【python】玩转数据分析、建模、人工智能常用的package整理-锐单电子商城

【python】常用于玩数据分析、建模和人工智能package整理

- - 一、python阅读各种格式的文件
  - - 1、pdf文件——pdfplumber
    - 2、word文件——docx
    - 3、excel文件——xlrd
    - 4、图片文件——PIL
    - 5、从pdf所有所有图片——fitz
  - 二、人工智能应用
  - - 1、OCR图像文字识别——easyocr
    - 2、OCR图像文字识别——paddleocr
    - 3、人脸识别——face_recognition
    - 4.文本分词分析——jieba
    - 5.中文情感分析、句法分析、词性标注、信息提取NLP神器——paddlenlp
    - 6.全国地址自动分析——cpca
  - 三、数据处理性能优化
  - - 1、pandas加速读取数据——feather
    - 2、数据二进制序列化——pickle
    - 3、numpy性能加速——numexpr
    - 4、numpy性能加速——cupy
    - 5、python计算函数优化——numba
    - 6、pandas性能优化——swifter
    - 7、pandas性能优化——modin
  - 四、自动化数据探索分析工具
  - - 1、d-tale
    - 2、pandas profiling
    - 3、sweetviz
    - 4、autoviz

一、python阅读各种格式的文件

1、pdf文件——pdfplumber

import pdfplumber ##用open方法或者load方法 读取pdf内容 with pdfplumber.open("example.pdf",password = 'paswrd') as pdf:  ##获取第一页内容     first_page = pdf.pages[0]     ###获取首页首字符     print(first_page.chars[0])

2、word文件——docx

from docx import Document #读取word文档 document = Document('sample.docx') #获取所有段落 all_paragraphs = document.paragraphs #循环所有段落 for paragraph in all_paragraphs:     #打印每个段落的文字     print(paragraph.text)

3、excel文件——xlrd

import xlrd #读入文件 workbook = xlrd.open_workbook(filename='sample.xlsx') #根据索引获取sheet表格 table = workbook.sheets()[0] #通过sheet名称获取表格 table = workbook.seet_by_name(sheet_name='Sheet2')
#获取指定行的内容
table_list = table.row_values(rowx=0, start_colx=0, end_colx=None)
#获取指定列的内容
table_list = table.col_values(colx=0, start_rowx=0, end_rowx=None)

相对应的excel写入包：

from xlwt import *
book = xlwt.Workbook(encoding='utf-8')
sheet1 = book.add_sheet("自定义名字的sheet")
sheet1.write(row_start_point, col_start_point, table_name, style1)

也可以直接用pandas读成一个dataframe：

import pandas as pd
df = pd.read_excel('filename.xlsx',sheet_name='Sheet2')

4、图片文件——PIL

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

##载入图片
picture = Image.open('example.png')
##把图片以散点数据形式显示出来
picture_data = np.array(picture)
plt.imshow(picture.astype('unit8'))

5、从pdf中提取出所有图片——fitz

fitz是pymupdf的子模块，需要先安装pymupdf：

pip install pymupdf

利用fitz提取信息加正则匹配，将模板元素转化为像素后再以图片形式写出：

import fitz
import re
import os
 
file_path = r'C:\xxx\xxx.pdf' # PDF 文件路径
dir_path = r'C:\xxx' # 存放图片的文件夹
 
def pdf2image1(path, pic_path):
    checkIM = r"/Subtype(?= */Image)"
    pdf = fitz.open(path)
    lenXREF = pdf._getXrefLength()
    count = 1
    for i in range(1, lenXREF):
        text = pdf._getXrefString(i)
        isImage = re.search(checkIM, text)
        if not isImage:
            continue
        pix = fitz.Pixmap(pdf, i)
        if pix.size < 10000: # 在这里添加一处判断一个循环
            continue # 不符合阈值则跳过至下
        new_name = f"img_{ 
          count}.png"
        pix.writePNG(os.path.join(pic_path, new_name))
        count += 1
        pix = None
 
pdf2image1(file_path, dir_path)
————————————————
版权声明：本文为CSDN博主「小白^-」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/weixin_46737755/article/details/113085763

二、人工智能应用

1、OCR图像文字识别——easyocr

import easyocr
## 实例化reader，指定识别什么语言，这里chi_sim和en代表简体中文和英文
reader = easyocr.Reader(['ch_sim','en'])
## 读取图片中的文字
result = reader.readtext('example.png')
##逐行打印文字结果
for res in result:
     print(res)

打印结果包含文字位置边框的坐标，文字内容和置信度


([[151, 101], [195, 101], [195, 149], [151, 149]], '好', 0.7816301184856115)

2、OCR图像文字识别——paddleocr

from paddleocr import PaddleOCR
##初始化paddleocr，选择使用gpu
ocr=PaddleOCR(use_angle_cls = True, use_gpu = True)
text=ocr.ocr("example.png",cls=True)
#打印识别的文字信息
for t in text:
    print(t[1][0])

3、人脸识别——face_recognition

人脸对比：识别两张图片中的人脸是否同一人

import face_recognition
##载入两张图片
known_image = face_recognition.load_image_file("lyf1.jpg")
unknown_image = face_recognition.load_image_file("lyf2.jpg")
##两张图片都encoding一下
lyf_encoding = face_recognition.face_encodings(known_image)[0]
unknown_encoding = face_recognition.face_encodings(unknown_image)[0]
##进行比对
results = face_recognition.compare_faces([lyf_encoding], unknown_encoding)
print(results)

人脸定位：找到图片中所有人脸的位置，并用矩形框出来

import face_recognition
import cv2
 
image = face_recognition.load_image_file("lyf1.jpg")
##model可以选择cnn，默认hog；hog速度快一些，准确度差一些
face_locations = face_recognition.face_locations(image,model='cnn')
# A list of tuples of found face locations in css (top, right, bottom, left) order

img = cv2.imread("lyf1.jpg")
cv2.imshow("lyf1.jpg",img) # 原始图片
 
# Go to get the data and draw the rectangle
for i,loc in enumerate(face_locations):
  top,right,bottow,left = loc
  start = (left, top)
  end = (right, bottom)
 
  color = (0,255,255)
  thickness = 2
  cv2.rectangle(img, start, end, color, thickness)

cv2.imshow("face_recon",img)

人脸关键点识别：定位人脸中的鼻子、眼睛、嘴巴等关键部位的位置

from PIL import Image, ImageDraw
import face_recognition
 
image = face_recognition.load_image_file("lyf1.jpg")
 
# Find all facial features in all the faces in the image
face_landmarks_list = face_recognition.face_landmarks(image)
 
# Create a PIL imagedraw object so we can draw on the picture
pil_image = Image.fromarray(image)
d = ImageDraw.Draw(pil_image)
 
for face_landmarks in face_landmarks_list:
 
  # Print the location of each facial feature in this image
  for facial_feature in face_landmarks.keys():
    print("The {} in this face has the following points: {}".format(facial_feature, face_landmarks[facial_feature]))
 
  # Let's trace out each facial feature in the image with a line!
  for facial_feature in face_landmarks.keys():
    d.line(face_landmarks[facial_feature], width=5)
 
# Show the picture
pil_image.show()

4、文本分词分析——jieba

import jieba
#精准模式
res = jieba.cut('某个句子')
#全模式
res = jieba.lcut('某个句子',cut_all=True)
#搜索引擎模式
res = jieba.cut_for_search('某个句子')

for item in res:
	print(item, end=' ')

5、情感分析、句法分析、词性标注、信息抽取的中文NLP神器——paddlenlp

仅用三行代码就实现了精准实体抽取：

from pprint import pprint
from paddlenlp import Taskflow
schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
ie = Taskflow('information_extraction', schema=schema)
pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")) # Better print results using pprint
>>> 
[{ 
        '时间': [{ 
        'end': 6, 'probability': 0.9857378532924486, 'start': 0, 'text': '2月8日上午'}],
  '赛事名称': [{ 
        'end': 23,'probability': 0.8503089953268272,'start': 6,'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
  '选手': [{ 
        'end': 31,'probability': 0.8981548639781138,'start': 28,'text': '谷爱凌'}]}]

能做到在开放域进行较为精准的信息抽取，取决于两个关键点：一、一个发表在ACL2022，屠遍信息抽取榜单的大一统信息抽取诸多子任务的技术UIE 二、首个知识增强语言模型——ERNIE 3.0 论文连接：https://arxiv.org/pdf/2203.12277.pdf 此外，paddlenlp其它强大功能的简单实现：

from paddlenlp import Taskflow

# Chinese Word Segmentation
seg = Taskflow("word_segmentation")
seg("第十四届全运会在西安举办")
>>> ['第十四届', '全运会', '在', '西安', '举办']

# POS Tagging
tag = Taskflow("pos_tagging")
tag("第十四届全运会在西安举办")
>>> [('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')]

# Named Entity Recognition
ner = Taskflow("ner")
ner("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
>>> [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), ('，', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]

# Dependency Parsing
ddp = Taskflow("dependency_parsing")
ddp("9月9日上午纳达尔在亚瑟·阿什球场击败俄罗斯球员梅德韦杰夫")
>>> [{ 
        'word': ['9月9日', '上午', '纳达尔', '在', '亚瑟·阿什球场', '击败', '俄罗斯', '球员', '梅德韦杰夫'], 'head': [2, 6, 6, 5, 6, 0, 8, 9, 6], 'deprel': ['ATT', 'ADV', 'SBV', 'MT', 'ADV', 'HED', 'ATT', 'ATT', 'VOB']}]

# Sentiment Analysis
senta = Taskflow("sentiment_analysis")
senta("这个产品用起来真的很流畅，我非常喜欢")
>>> [{ 
        'text': '这个产品用起来真的很流畅，我非常喜欢', 'label': 'positive', 'score': 0.9938690066337585}]

paddlenlp还具备强大的小样本定制训练能力，通过增加少量训练样本，就可以显著提升识别准确度，更多信息可参考官网和这篇文章。

6、自动解析全国地址——cpca

可自动补全并拆分地址信息中的省、市、区、详细地址和行政区划代码：

import cpca

location_str = ["徐汇区虹漕路461号58号楼5楼", 
                "泉州市洛江区万安塘西工业区", 
                "北京朝阳区北苑华贸城"]

##生成一个包含省、市、区、地址、行政区码列的dataframe：
data=cpca.transform(location_str)

##对于重名区可通过umap参数进行指定：
data = cpca.transform(['朝阳区汉庭酒店'],umap={ 
        '朝阳区':110105})

三、数据处理的性能优化

1、pandas读取数据加速——feather

用二进制feather文件替代csv文件，大大提高文件读写速度。

安装： pip install feather-format

import feather
##写文件
df.to_feather('example.feather')
##读文件，可以指定只读特定列
df.read_feather('example.feather',columns=['col1','col2'])

2、数据二进制序列化——pickle

python中几乎所有的数据类型（列表，字典，集合，类等）都可以用pickle来序列化

import pickle

a = { 
        'a': 1, 'b': 2}
with open('example.txt', 'wb') as f:
    pickle.dump(a, f)
with open('example.txt', 'rb') as f2:
    b = pickle.load(f2)

3、numpy性能加速——numexpr

numexpr的使用方法很简单：将numpy语言用引号引起来，并使用numexpr中的evaluate方法调用即可：

import numexpr as ne
import numpy as np

a = np.linspace(0,1000,10000)
##用numexpr优化numpy语句
ne.evaluate('a**20')

4、numpy性能加速——cupy

cupy是一个借助CUDA GPU库在英伟达GPU上实现numpy数组的库。GPU自身具有多个CUDA核心促成更好的并行加速。用法和numpy类似。

import cupy as cp

x_gpu = cp.ones((1000,1000,1000))

5、python计算函数优化——numba

numba可以将python函数转换为优化的机器学习代码，速度可以基本接近C或Fortran。使用也非常简单，只需要在自定义函数前面加个装饰器即可：

import numba as nb
@nb.jit
def nb_sum(a):
	sum = 0
	for i in range(len(a)):
		sum+=a[i]
	return sum

import numpy as np
a = np.linspace(0,1000,1000)
nb_sum(a)

numba还支持GPU加速、矢量化加速，刻进一步提高性能。

from numba import cuda
cuda.select_device(1)

@cuda.jit
def cudasquare(x):
	i,j = cuda.grid(2)
	x[i][j] *= x[i][j]

##矢量化
from math import sin
@nb.vectorize()
def nb_vec_sin(a):
	return sin(a)

6、pandas性能优化——swifter

swifter是pandas的一个插件，可以直接在pandas上操作，其功能就是检验计算是否可并行或矢量化，以提高性能。

import pandas as pd
import swifter

df.swifter.apply(lambda x:x.mean())

7、pandas性能优化——modin

modin可以实现pandas的并行读取和运行，对量级较大的数据处理有很好的加速。用法简单，只需要import一下，剩下的和pandas一样用。

import modin.pandas as pd
df = pd.concat([df,df2,df3])

参考文章： https://mp.weixin.qq.com/s/0CbaLTHGsGbD3hVN5Rcn0A https://mp.weixin.qq.com/s/-XPGXEl0tRmjKq48z7P-7g https://blog.csdn.net/weixin_46737755/article/details/113085763 https://blog.csdn.net/juzicode00/article/details/122243330 https://www.jianshu.com/p/7953beff8ca3 https://www.jb51.net/article/182659.htm https://mp.weixin.qq.com/s/E-xZlwz4Ag00paEb-xwluw

资讯详情

【python】玩转数据分析、建模、人工智能常用的package整理

【python】常用于玩数据分析、建模和人工智能package整理

一、python阅读各种格式的文件

1、pdf文件——pdfplumber

2、word文件——docx

3、excel文件——xlrd

4、图片文件——PIL

5、从pdf中提取出所有图片——fitz

二、人工智能应用

1、OCR图像文字识别——easyocr

2、OCR图像文字识别——paddleocr

3、人脸识别——face_recognition

4、文本分词分析——jieba

5、情感分析、句法分析、词性标注、信息抽取的中文NLP神器——paddlenlp

6、自动解析全国地址——cpca

三、数据处理的性能优化

1、pandas读取数据加速——feather

2、数据二进制序列化——pickle

3、numpy性能加速——numexpr

4、numpy性能加速——cupy

5、python计算函数优化——numba

6、pandas性能优化——swifter

7、pandas性能优化——modin

四、自动化数据探索性分析工具

1、d-tale

2、pandas profiling

3、sweetviz

4、autoviz

Melexis MLX90425 Triaxis 360°旋转位置传感器的介绍、特性、及应用

【python】玩转数据分析、建模、人工智能常用的package整理

【python】常用于玩数据分析、建模和人工智能package整理

最近热搜

历史搜索 清除历史记录

历史搜索清除历史记录