今天和公司的设计师聊了聊 UBER 开源的基于 deck.gl 的地理数据可视化工具 kepler.gl,效果非常炸裂!

https://uber.github.io/kepler.gl/#/demo

在我使用的试用过程中,我发现这个 client-side web App 可以将动辄 20M+ 的 sample data 渲染时间控制在极短的时间内,不得不说现在前端框架的强劲,js统治世界#手动滑稽

下面整理一些 kepler.gl 提供的公开数据集,供以后查看使用吧。

  1. Open Data Paris The site of the open data approach of the City of Paris. You will find here all the datasets published by the services of the City and its partners.
  2. California Earthquakes The dataset contains a list of 2.5+ magnitude earthquakes in california. Information was generated using USGS website and contains multiple properties (location, magnitude, magtype) for each single entry.
  3. New York City cab rides The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
  4. San Francisco elevation contour Physical Features – Elevation contours for San Francisco mainland and Treasure Island/Yerba Island. Based on San Francisco Elevation Datum.
  5. New york city population by census tract This dataset contains the 2010 Census tract map, joined with population data of NYC.
  6. San Francisco street tree map From DataSF, this dataset contains a list of dpw maintained street trees including planting date, species and location.
  7. England and Wales Commute Map This dataset shows location of residence and place of work, based on 2011 Census of residence of England and Wales. The data classifies people currently resident in each middle layer super output area, or higher area and shows the movement between their area of residence and workplace.
  8. $2+ million homes built in Los Angeles since 2006 Valuation and property description for parcels on the Assessor’s annual secured assessment rolls 2006 through 2017. This dataset excludes Cross Reference Rolls (89xx-series AssessorID).
  9. Travel Times from Uber Movement Uber Movement provides free and public access to travel times data derived from billions of Uber trips. Data is available under a Creative Commons Noncommercial Attribution License.
  10. 2017 Unemployment Rates for U.S. Counties 2017 labor force information from the Bureau of Labor Statistics joined with county shapefiles from the Census Bureau.

Ref

[1] https://github.com/uber-web/kepler.gl-data
[2] https://github.com/uber/kepler.gl
[3] https://uber.github.io/kepler.gl/#/demo

the universe’s stellar tapestry of birth and destruction

拥有一个漂亮的蝴蝶结是三维生物的特权,

唯有在这个空间里扭结才能保持打结状。

而在四维空间中,

他从来都没有被系上。

–MorningRocks

自然语言处理一直以来被誉为人工智能皇冠上的明珠,语义分析指的运用NLP和机器学习方法挖掘与学习文本深层次概念,wikipedia上的解释如下:

In machine learning, semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents (or images).

0x00 AIOps 语义分析现状

包含语义分析在内,智能运维领域的NLP面临的挑战颇多,比如:

  • 问句领域性强
  • 用户意图复杂
  • 上下文相关性强
  • 问题多样
  • 指代缺失
  • 口语化严重

在实际应用场景中,所有数据将会来源于真实业务沟通场景,部分协同工具、邮件等结构化摘要数据还包含业务和运维工程师针对当次问答数据的推理知识,这些内容都会让自然语言处理的落地更具挑战。

0x01 分词与构建语言模型

通常在拿到一段文本后要先做分词处理,目前较常用的几种分词方法如下:

  1. 基于字符串匹配的分词方法。此方法按照不同的扫描方式,逐个查找词库进行分词。根据扫描方式可细分为:正向最大匹配,反向最大匹配,双向最大匹配,最短路径最小切分等等
  2. 全切分方法。它首先切分出与词库匹配的所有可能的词,再运用统计语言模型决定最优的切分结果。它的优点在于可以解决分词中的歧义问题。
  3. 由字构词的分词方法。可以理解为字的分类问题,也就是自然语言处理中的sequence labeling问题,通常做法里利用HMM,MAXENT,MEMM,CRF等预测文本串每个字的tag。由于CRF既可以像最大熵模型一样加各种领域feature,又避免了HMM的齐次马尔科夫假设,所以基于CRF的分词目前是效果最好的。除了HMM,CRF等模型,分词也可以基于深度学习方法来做。

语言模型是用来计算一个句子产生概率的概率模型,即P(1, 2, 3…m),m表示词的总个数。

根据贝叶斯公式:

\(P(A|B)=\frac{P(B|A)P(A)}{P(B)}\)

Continue reading

A partial solar eclipse over Ross Lake by Bill Ingalls/NASA

A still more glorious dawn awaits.

Not a sunrise, but a galaxy rise.

A morning filled with 400 billion suns.

The rising of the milky way.

— Carl Sagan, Cosmos